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Preface 



On behalf of the program committee, we were pleased to present this year’s 
program for ACS AC: Asia-Pacific Computer Systems Architecture Conference. 

Now in its ninth year, ACSAC continues to provide an excellent forum for 
researchers, educators and practitioners to come to the Asia-Pacific region to 
exchange ideas on the latest developments in computer systems architecture. 
This year, the paper submission and review processes were semiautomated using 
the free version of CyberChair. We received 152 submissions, the largest number 
ever. Each paper was assigned at least three, mostly four, and in a few cases even 
five committee members for review. All of the papers were reviewed in a two- 
month period, during which the program chairs regularly monitored the progress 
of the review process. When reviewers claimed inadequate expertise, additional 
reviewers were solicited. In the end, we received a total of 594 reviews (3.9 per 
paper) from committee members as well as 248 coreviewers whose names are 
acknowledged in the proceedings. We would like to thank all of them for their 
time and effort in providing us with such timely and high-quality reviews, some 
of them on extremely short notice. 

After all of the reviews were received, there was a one-week electronic pro- 
gram committee meeting during May 14 and May 21. All of the papers were 
reviewed and discussed by the program committee, and the final set of papers 
were selected. Program committee members were allowed to submit papers, but 
their papers were handled separately. Each of their papers was assigned to at 
least four committee members and reviewed under the same rigorous review 
process. The program committee accepted 7 out of 11 “PC” submissions. In the 
end, the program committee selected a total of 45 papers for this year’s program 
with an acceptance rate close to 30%. Unfortunately, many fine papers could not 
be accommodated in this year’s program because of our schedule. 

In addition to the contributed papers, this year’s program included invited 
presentations. We were very pleased that three distinguished experts accepted 
our invitation to share their views on various aspects of computer systems ar- 
chitecture design: James E. Smith (University of Winconsin-Madison, USA) on 
Some Real Observations on Virtual Machines , Jesse Z. Fang (Intel, USA) on 
A Generation Ahead of Microprocessor: Where Software Can Drive uArchitec- 
ture to?, and, finally, Guojie Li (Chinese Academy of Sciences, China) on Make 
Computers Cheaper and Simpler. 

On behalf of the program committee, we thank all of the authors for their 
submissions, and the authors of the accepted papers for their cooperation in 
getting their final versions ready in time for the conference. We would also like 
to thank the Web Chair, Lian Li, for installing and maintaining CyberChair, and 
the Local Arrangements Chair, Wenguang Chen, for publicizing this conference. 

Finally, we want to acknowledge the outstanding work of this year’s pro- 
gram committee. We would like to thank them for their dedication and effort 
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in providing timely and thorough reviews for the largest number of submissions 
ever in our conference history, and their contribution during the paper selection 
process. It was a great pleasure working with these esteemed members of our 
community. Without their extraordinary effort and commitment, it would have 
been impossible to put such an excellent program together in a timely fashion. 
We also want to thank all our sponsors for their support of this event. Last, 
but not least, we would like to thank the General Chair, Weimin Zheng for his 
advice and support to the program committee and his administrative support 
for all of the local arrangements. 
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Some Real Observations on Virtual Machines 



James E. Smith 



Department of Electrical and Computing Engineering 
University of Wisconsin-Madison 
jesOece . wise . edu 



Abstract. Virtual machines can enhance computer systems in a number 
of ways, including improved security, flexibility, fault tolerance, power ef- 
ficiency, and performance. Virtualization can be done at the system level 
and the process level. Virtual machines can support high level languages 
as in Java, or can be implemented using a low level co-designed para- 
digm as in the Transmeta Crusoe. This talk will survey the spectrum 
of virtual machines and discuss important design problems and research 
issues. Special attention will be given to co-designed VMs and their ap- 
plication to performance- and power-efficient microprocessor design. 




Replica Victim Caching to Improve Reliability 
of In-Cache Replication 



W. Zhang 
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Abstract. Soft error conscious cache design is a necessity for reliable 
computing. ECC or parity-based integrity checking techniques in use 
today either compromise performance for reliability or vice versa. The 
recently-proposed ICR (In-Cache Replication) scheme can enhance data 
reliability with minimal impact on performance, however, it can only ex- 
ploit a limited space for replication and thus cannot solve the conflicts 
between the replicas and the primary data without compromising either 
performance or reliability. This paper proposes to add a small cache, 
called replica victim cache, to solve this dilemma effectively. Our exper- 
imental results show that a replica victim cache of 4 entries can increase 
reliability of LI data caches 21.7% more than ICR without impacting 
performance, and the area overhead is within 10%. 



1 Introduction and Motivation 

Soft errors are unintended transitions of logic states caused by external radiations 
such as alpha particle and cosmic ray strikes. Recent studies [4, 6, 5, 9] indicate 
that soft errors are responsible for a large percentage of computation failures. 
In current microprocessors, over 60% of the on-chip estate is taken by caches, 
making them more susceptible to external radiations. The soft errors in cache 
memories can easily propagate into the processor registers and other memory 
elements, resulting in catastrophic consequences on the execution. Therefore, 
soft error tolerant cache design is becoming increasingly important for failure- 
free computation. 

Information redundancy is the main technique to improve data integrity. 
Currently the popular information redundancy scheme for memories is either 
byte-parity (one bit parity per 8-bit data) [1], or single error correcting-double 
error detecting (SEC-DED) code (ECC) [2,3] . However, both of these two schemes 
have deficiencies. Briefly, parity check can only detect single-bit errors. While 
SEC-DEC based schemes can correct single-bit errors and detect two-bit errors, 
they can also increase the access latency of the LI cache, and thus not suitable 
for high-end processors clocked over 1GHz [7]. Recently, an approach called ICR 
(In-Cache Replication) has been proposed to enhance reliability of the LI data 
cache for high-performance processors [7]. The idea of ICR is to exploit “dead” 
blocks in the LI data cache to store the replicas for “hot” blocks so that a large 
percentage of read hits in the LI can find their replicas in the same cache, which 
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can be used to detect and correct single bit and/or multiple bit errors. While the 
ICR approach can achieve a better tradeoff between performance and reliability 
than parity-only or ECC-only protection, it can only exploit a limited space 
(namely the “dead” blocks in the data cache) for replication. In addition, since 
the replica and the primary data are stored in the same cache, they inevitably 
have conflicts with each other. The current policy adopted by ICR [7] is to give 
priority to the primary data for minimizing the impact on performance. In other 
words, the data reliability is compromised. As illustrated in [7], 10% to 35% 
of data in the LI data cache is not protected (i.e. having no replicas) by ICR 
schemes, which may cause severe consequences in computation and thus are not 
useful for applications that require high reliability or operate under highly noisy 
environments. 

This paper proposes a novel scheme to enhance the reliability of ICR further 
by adding a small fully-associative cache to store the replica victims, which is 
called the replica victim cache in this paper. Unlike the victim cache proposed 
by jouppi [10] for reducing the conflict misses for a direct-mapped cache without 
In-Cache Replication, the proposed replica victim cache is utilized to store the 
replica victims, which are conflicting with the primary data or other replicas in 
the primary data cache, for enhancing reliability of the ICR approaches signif- 
icantly without compromising performance. Moreover, since the replica is used 
to improve data integrity, the replica victim cache does not need to swap the 
replica with the primary data when accessed. In contrast, the traditional victim 
cache stores different data (i.e., victim) from the primary cache, and the victims 
need to be swapped to the LI cache in the case of a miss in the LI data cache 
that hits in the victim cache[10]. The paper examines the following problems. 

1. How does reliability, in terms of loads with replica (see the definition in 4), 
relate to the size of the replica victim cache, the size and associativity of 
the primary cache? How much loads with replica can be increased by the 
addition of a replica victim cache? 

2. How to exploit the replicas in either the primary cache or the replica victim 
cache to provide different levels of reliability and performance? 

3. What is the error detection and correction behavior of different replica-based 
schemes under different soft error rates? 

We implemented the proposed replica victim caching schemes by modifying 
the Simplesclar 3.0 [14]. The error injection experiments are based on random 
injection model [5]. Our experimental results reveal that a replica victim cache 
of 4 entries can increase the reliability of ICR by 21.7% without impacting per- 
formance and its area overhead is less than 10%, compared to most LI data 
caches. 

The rest of the paper is organized as follows. Section 2 introduces the back- 
ground information about In-Cache Replication and its limitation. Section 3 de- 
scribes the architecture of replica victim caching and different schemes to exploit 
the replica victim lines for improving data reliability. The evaluation method- 
ology is given in section 4. Section 5 presents the experimental results. Finally, 
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section 6 summarizes the contributions of this paper and identifies directions for 
future research. 

2 Background and Motivation 

A recent paper [7] presents ICR (In-Caclre Replication) for enhancing data cache 
reliability without significant impact on performance. The idea of ICR is to 
replicate hot cache lines in active use within the dead cache lines. The mapping 
between the primary data and the replica is controlled by a simple function. 
Two straightforward mapping functions are studies in [7], namely, the vertical 
mapping (replication across sets) and the horizontal mapping (replication within 
the ways of a set) [7] , as shown in figure 1. The dead cache lines are predicted by 
a time-based dead block predictor, which is shown to be highly accurate and has 
low hardware overhead [8] . The ICR approach can be used either with parity or 
ECC based schemes and there is a large design space to explore, including what, 
when and where to replicate the data, etc [7]. The design decisions adopted in 
this paper is presented in table 1. 

The results in [7] demonstrate that the ICR schemes can achieve better per- 
formance and/or reliability than the schemes that use ECC or parity check alone, 
however, it can only achieve modest reliability improvement due to the limited 
space it can exploit. Since each LI data cache has a fixed number of cache lines, 
and each cache line can be either used to store the primary data to benefit per- 
formance or to store the replicas for enhancing data reliability, ICR approaches 
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Fig. 1. Cache line replication (a) vertical replication (b) horizontal replication [7]. 



Table 1. Default implementation strategies regarding cache line replication. 



Question 


Answer 


When do wc replicate? 


Only during writes 


Where c3o we replicate? 


X/2 sets away from the primary location 


How many times do wc attempt replication': 1 


Once for each write 


TTow many replicas c3<> we create? 


At most 1 per line 


TTow do we protect uti replicated cache lines? 


Lsing parity 


How do wc protect replicated cache lines? 


Using parity or parallel comp ari son between 
the primary data and its replica, depending on 
the scheme we use (see section ,'J) 


How do wc place a replica in a set? 


Candidates include both dead lines 

ami rey>l icas (we checlc [.Tie dead 1 ines first) 


What needs to be done upon a replacement? 


Wc remove replica as well 
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give priority to performance by using several strategies. Firstly, the dead block 
predictor is designed to be very accurate so that the blocks predicted dead will 
most likely not be accessed in the near future. Otherwise, the primary data has 
to be loaded from the higher level memory hierarchy, resulting in performance 
penalty. Secondly, in case of a conflict between the primary data and the replica, 
ICR gives higher priority to the primary data and the replica is simply discarded. 
For instance, when a primary data is written to a cache block, which stores a 
replica for another primary data, the replica will be overwritten by the coming 
primary data. As a result, ICR approach can only the replicas that do not con- 
flict with the primary data, resulting in moderate data reliability improvement. 
The experiments in [7] also reveal that 10% to 35% of load hits in LI cannot find 
their replicas and ICR schemes have more than 20% unrecoverable loads under 
intense error injection. With the trend of increasing soft error rate, the reliability 
of ICR approaches need to be improved further, especially for applications that 
demand very high data reliability or operate under lrighly-noisy environments. 

3 Replica Victim Cache 

While one straightforward way to enhance data reliability of ICR further is 
to make more replicas in the LI data cache by giving priority to replicas in 
case of their conflicts with the primary data. This approach, however, can in- 
evitably degrade performance and thus is not acceptable. Another approach used 
in mission-critical applications is the NMR (N Modular Redundancy) scheme, 
which replicate the data cache for multiple times. However, the NMR scheme is 
too costly for microprocessors or embedded system with cost and area constraint. 

This paper proposes an approach to enhance data reliability of ICR without 
performance degradation or significant area overheads. The idea is to add a small 
fully-associative replica victim cache to store the replica victim lines whenever 
they are conflicting with primary cache lines. Due to the fully associativity of 
the replica victim cache and data locality (replicas also exhibit temperal and 
spatial locality, since they are the “images” of the primary data), a very large 
percentage of load hits in the LI can find their replicas available either in the 
LI data cache or in the replica victim cache. 

Victim cache is not a new idea in cache design. Jouppi proposed the victim 
cache to store the victim lines evicted from the LI cache for reducing the conflict 
misses [10]. However, the victim cache proposed by Jouppi cannot be used to 
enhance data reliability, since there is no redundant copies in both the LI cache 
and the victim cache (i.e., only the blocks evicted from the LI are stored in the 
victim cache). While the original victim cache aims at performance enhance- 
ment, the objective of the replica victim cache is to improve data integrity of 
ICR approaches significantly without impacting performance. The replica vic- 
tim cache is a small fully associative cache in parallel with the LI data cache, as 
shown in figure 2. In addition to In-Cache Replication, the replica victim cache 
is used to store replicas for the primary data in the LI in the following cases: 



1. There is no dead block available in the LI data cache to store the replica. 
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Fig. 2. The architecture of replica victim cache. 



2. The replica is replaced by the primary data since ICR gives priority to the 
primary data. 

3. The replica is replaced by another replica for another primary data (note 
that for a set-associative data cache, multiple replicas can be mapped to the 
same dead block with ICR approach [7]). 

Since ECC computation has performance overhead, we assume all the cache 
blocks of the LI and the replica victim cache are protected by parity check. The 
replicas (both in the replica victim cache and the LI data cache due to ICR) 
can be read at each read hits in the LI for parallel comparison with the primary 
data to detect multiple bit errors. Alternatively, the replicas can be read only 
when the parity bit of the primary data indicates an error. The former scheme 
can enhance data reliability greatly, but there is a performance penalty for the 
parallel comparison. We assume it takes 1 extra cycle to compare the primary 
data and the replica in our simulations. The latter scheme also improve the 
reliability by being able to recover from single bit errors. 

The paper examines the following problems. 

1. How does the reliability, in terms of loads with replica (see 4), relate to the 
size of the replica victim cache, the size and associativity of the primary cache? 
How much loads with replica can be increased by the addition of a replica victim 
cache? 

2. How to exploit the replicas in either the primary cache or the replica victim 
cache to provide different levels of reliability and performance? 

3. How does the error detection and correction behavior of different replica- 
based schemes under different soft error rates? 

To answer the above questions, we propose and evaluate the following 
schemes: 
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— BaseP: This is a normal LI data cache without the replica victim cache. 
All cache blocks are protected by parity. The load and store operations are 
modeled to take 1 cycle in our experiments. 

— BaseECC : This scheme is similar to BaseP scheme except that all cache blocks 
are protected by ECC. Store operations still take 1 cycle (as the writes are 
buffered), but loads take 2 cycles to account for the ECC verification. 

— RVC-P : This scheme implements the In-Caclre Replication and the proposed 
replica victim caching. When there is a conflict between the primary data 
and the replica, the replica victim is stored to the replica victim cache. If the 
replica victim cache is full, the least-recently used replica is discarded. All 
cache blocks are protected by parity and the replica is only checked if the 
parity bit of the primary data indicates an error. Load and store operations 
are modeled to take 1 cycle in our experiments. 

— RVC-C: This scheme is similar to RVC-P scheme except that the replica 
is compared with the primary data before the load returns. The search of 
replica can be executed simultaneously in both the LI data cache and the 
replica victim cache. If the replica hits in the LI data cache, that replica is 
used to compare with the primary data; otherwise, the replica in the replica 
victim cache is used for comparison if there is a hit in the replica victim 
cache. Note that we give priority to the replicas found in the LI data cache, 
because the LI contains the most updated replicas while the replica victim 
cache may not (because it only contains the most updated replica victims) . 
However, for the given primary data, if its replica cannot be found in the LI 
data cache, the replica found in the replica victim cache must contain the 
most updated value because every replica victim of the data must have been 
written to the replica victim cache. We conservatively assume that the load 
operations take 2 cycles, and store operations take 1 cycle as usual. 

It should be noted that in addition to parity check and ECC, the write- 
througlr cache and speculative ECC loads are also widely employed for improving 
data reliability. For write-through caches, data redundancy is provided by prop- 
agating every write into the L2 cache. However, write-tlrrough caches increase 
the number of writes to the L2 dramatically, resulting in the increase of write 
stalls even with a write-buffer. Thus both performance and energy consumption 
can be impacted [7]. Another way to improve data reliability while circumventing 
the ECC time cost is the speculative ECC load scheme, which performs the ECC 
checks in the background while data is loaded and the computation is allowed 
to proceed speculatively. While speculative ECC loads can potentially hide the 
access latency, it is difficult to stop the error propagation in a timely manner 
and may result in high error recovery cost. Since ECC computation consumes 
more energy than parity check, it is also shown that speculative ECC load has 
worse energy behavior than the ICR approach that uses parity check only (i.e. , 
ICR-P-PS) [7]. Due to these reasons, we focus on investigating approaches to im- 
prove reliability for write-back data caches, and we only compare our approach 
directly with the recently-proposed ICR approaches, which have exhibited better 
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performance and/or energy behavior than the write-through LI data cache and 
the speculative ECC load [7]. 

4 Evaluation Methodology 

4.1 Evaluation Metrics 

To compare performance and reliability of different schemes, we mainly use the 
following two metrics: 

— Execution Cycles is the time taken to execute 200 million application in- 
structions. 

— Loads with Replica is the metric proposed in [7] to evaluate the reliability 
enhancement for data caches. A higher loads with replica indicates higher 
data reliability, as illustrated by error injection experiments [7]. Since we 
add a replica victim cache to the conventional LI data cache architecture, 
we modify the definition of loads with replica proposed in [7] to be the 
fraction of read hits that also find their replicas either in the LI data cache 
or in the replica victim cache. Note that the difference between our definition 
and the definition in [7] is that in our scheme, the replica of “dirty” data can 
be found either in the LI data cache or in the replica victim cache; while in 
the ICR scheme [7], the replicas can be only found in the LI data cache. 

4.2 Configuration and Benchmarks 

We have implemented the proposed replica victim caching schemes by modify- 
ing the Simplesclar 3.0 [14]. We conduct detailed cycle level simulations with 
sim-outorder to model a multiple issue superscalar processor with a small fully- 
associative replica victim cache. The default simulation values used in our ex- 
periments are listed in Table 2 (note that we do not list the configuration of the 
replica victim cache in Table 2, since we need to make experiments with different 
replica cache size) . 

We select ten applications from the SPEC 2000 suite [16] for this evaluation. 
Since the simulations of these applications are extremely time consuming, we 
fast forward the first half billion instructions and present results for the next 200 
million instructions. Important performance characteristics of these applications 
in the base scheme are given in Table 3. 

5 Results 

5.1 The Size of the Replica Victim Cache and Data Integrity 
Improvement 

Our first experiment is to investigate what is the appropriate size for the replica 
victim cache. On one hand, the replica victim cache must be small to minimize 
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Table 2. Configuration parameters in our base configuration for a superscalar archi- 
tecture. All caches are write-back. 



1 Configuration Parameter 


Value I 


1 Processor 1 


Functional Units 


4 IALUs, 4FPU 


LSQ Size 


8 Instructions 


RUU Size 


16 Instructions 


Fetch Width 


4 instructions/cycle 


Decode Width 


4 instructions/cycle 


Issue Width 


4 instructions/cycle 


Commit Width 


4 instructions/cycle 


Fetch Queue Size 


4 instructions 


Cycle Time 


Ins 


1 Cache and Memory Hierarchy 1 


LI Instruction Cache 


16KB, 1-way, 32 byte blocks, 1 cycle latency 


LI Data Cache 


16KB, 4- way, 32 byte blocks, 1 cycle latency 


L2 


256K, 4-way, 64 byte blocks, 6 cycle latency 


Memory 


100 cycle latency 



Table 3. Benchmarks from SPEC2000. The last column give the execution cycles for 
Base scheme. 



Benchmark 

Name 


Description 


Number of 
Data References 


Number of 
Cache Misses 


Execution Cycles 
of Base 


164.gzip 


Compression 


58582206 


1167237 


129215053 


175. vpr 


FPGA circuit placement and routing 


87602536 


1968330 


202606221 


176. gcc 


C programming language compiler 


79860452 


982483 


220017658 


181. mef 


Combinational optimization 


112113406 


10423182 


240140816 


255. vortex 


Object-oriented database 


110330626 


1591461 


231793925 


256.bzip2 


Compression 


96251361 


1550219 


134186684 


177. mesa 


3D graphics library 


98933099 


224339 


199721006 


179. art 


C Image recognition/neural networks 


87569639 


7144640 


236659516 


183. equake 


Seismic wave propagation simulation 


64742897 


373522 


118643743 


188.ammp 


C Computational chemistry 


118184707 


15826817 


354806058 



the hardware overheads. On the other hand, the replica victim cache should 
be large enough to accommodate the replica victims as many as possible. Due 
to data locality, it is possible to use a small replica victim cache to store the 
replica victims for the data, which is most likely accessed in the future. We 
use an empirical approach to find the best size for the replica victim cache. 
Specifically, we make experiments on two randomly selected benchmarks (i.e. , 
bzip2 and equake) for replica victim caches with different sizes varying from 1 
block to 16 blocks and the LI data cache is fixed to be 16K-Byte, 4-way set 
associative, as given in table 2. The loads with replica results are presented in 
figure 3. As can be seen, the loads with replica increases dramatically when the 
size of the replica victim cache is increased from 1 block to 2 blocks, because the 
conflicts of replicas in the replica victim cache can be reduced by exploiting the 
associativity. For the replica victim cache of 4 or more blocks, the loads with 
replica is larger than 98.6%, which is tremendously larger than the loads with 
replica achieved by ICR schemes. We use cacti 3.2 model [15] to estimate the 
area overhead and the results are shown in table 4. As can be seen, the area 
overhead of the 4-entry replica victim cache is less than 10%, compared to a 
data cache of 16K, 32K or 64K bytes. Considering both reliability enhancement 
and hardware overhead, we fix the size of the replica victim cache to be 4 entries. 
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Table 4. The area of a 4-entry fully associative replica victim cache, compared with 
4-way associative LI data caches of 16K, 32K and 64K bytes respectively. The cache 
block size of both the replica victim cache and the data cache are 32 bytes. 





128B RVC 


16KB D-cache 


32KB D-cache 


64KB D-cache 


area (cm z ) 


0.001183 


0.011899 


0.021325 


0.038569 


ratio 


100% 


9.94% 


5.55% 


3.07% 
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Fig. 3. Loads with replica for replica victim caches of 1, 2, 4, 8 and 16 blocks. 




Figure 4 illustrates the loads with replica for a replica victim cache with 4 
blocks for all the 10 benchmarks. The replicas can be found either from the LI 
data cache (as the ICR approach) or from the replica victim cache. We find that 
for each benchmark, replica victim cache can store a large portion of the replica 
victims that are most likely accessed in the future, in addition to the replicas 
produced by the ICR approach, resulting in significant enhancement on data 
reliability. The average loads with replica with the replica victim cache is 94.4%, 
which is 21.7% larger than the ICR approach alone. 



5.2 Performance Comparison 

Using the above settings for the replica victim cache, we next study the perfor- 
mance of the replica victim caching. Since the replica victim cache is only used to 
store the replica victims from the LI data cache, there is no performance degra- 
dation compared to the corresponding ICR approaches [7]. Therefore, we only 
compare the performance implications of the four schemes described in Section 3. 

As shown in figure 5, the performance of the RVC-P scheme is comparable 
to the BaseP scheme and the performance of the RVC-C scheme is comparable 
to the BaseECC scheme. Specifically, the average performance degradation of 
RVC-P to BaseP and RVC-C to BaseECC is 1.8% and 1.7% respectively. It 
should be noted that this performance degradation comes from ICR, not from 
the replica victim caching. Since ICR relies on dead block prediction and there 
is no perfect dead block predictor, some cache blocks in the LI may be predicted 
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Fig. 4. Loads with replica for replica victim caches of 4 blocks. 



□ BaseP ■ BaseECC DRVC-P DRVC-C 

400000000 

350000000 

m 300000000 
$> 

& 250000000 

O 

g 200000000 
o 150000000 
UJ 100000000 
50000000 
0 

f f # * 

Fig. 5. Performance comparison of different schemes. 




dead (and thus are utilized to store replicas) but are actually accessed later (i.e. , 
not “dead” yet), which can result in performance degradation. 

5.3 Error Injection Results 

We conduct the error injection experiment based on random injection model [5]. 
In this model, an error is injected in a random bit of a random word in the 
LI data cache. Such errors are injected at each clock cycle based on a constant 
probability (called error injection rate in this paper). 

Figure 6 and figure 7 present the fraction of loads that could not be recovered 
from errors (including single-bit or mult-bit errors) for BaseP, BaseECC, ICR- 
P-PS, RVC-P and RVC-C for bzip2 and vpr respectively. In both experiments, 
the data is loaded as a function of the probability of an error occurring in each 
cycle and the error injection rate varies from 1/100 to 1/1000 and 1/10000. 
Note that the intention here is to study the data reliability under intense error 
behavior, thus very high error injection rates are used. As can be seen, BaseP 
always has the worst error resilient behavior, since the parity bit can only detect 
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single-bit errors. When the error injection rate is relatively low (i.e., 1/10000), 
BaseECC has similar percentage of unrecoverable loads as RVC-C. However, as 
the error injection rate increases, the difference between BaseECC and RVC-C 
grows larger. Specifically, when the error injection rate is 1/1000, RVC-C can 
reduce the unrecoverable load by 9.1% and 4.5% for bzip2 and vpr respectively, 
compared to the BaseECC scheme. Similarly, the RVC-P scheme exhibits much 
better error resilient behavior compared to BaseP and ICR-P-PS at different 
error injection rate. 




Fig. 6. Percentage of unrecoverable 
loads for bzip2. 



Fig. 7. Percentage of unrecoverable 
loads for vpr. 



5.4 Sensitivity Analysis 

To verify the effectiveness of the replica victim cache of 4 blocks for LI data 
caches with different configurations, we also make experiments to study the loads 
with replica by varying the LI data cache size and the number of associativity. 
In both experiments, the replica victim cache is fixed to be a fully-associative 
cache of 4 blocks. 

Figure 8 gives the loads with replica for LI data caches of 8K, 16K, 32K, 
64K and 128K bytes. The block size and the associativity of the LI data cache 
are 32 bytes and 4 way respectively. The results are very interesting. As can 
be seen, bzip2 and equake exhibit different trends in loads with replica. As the 
data cache size increases, the loads with replica of bzip2 decreases slightly, while 
the loads with replica of equake increases slightly. The reason is that replicas 
can be found in two places: the LI data cache and the replica victim cache. 
The number of replicas that can be stored in the LI increases as the LI data 
cache size increases, however, the relative number of replicas that can be stored 
in the replica victim cache decreases since the size of replica victim cache is 
fixed. Therefore, the effect of increasing the LI data cache size on the loads with 
replica is dependent on these two factors. The breakdown of loads with replica 
from the LI data cache and from the replica victim cache for bzip2 and equake 
are presented in figure 9 and figure 10 respectively. In figure 9, the decrease of 
loads with replica from the replica victim cache dominates, and thus the total 
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loads with replica decreases as the LI data cache size increases. In contrast, in 
figure 10, the increase of loads with replica from the LI data cache dominates, 
and hence the total loads with replica increases with the increase of the LI data 
cache size. However, for all the LI data cache configurations, the replica victim 
cache of 4 entries can achieve the loads with replica more than 98.1% on average, 
which is substantially larger than what the ICR approach alone can achieve. 
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Fig. 8. Loads with replica for 4-way associative LI data caches of 8K, 16K, 32K, 64K 
and 128K bytes. The replica victim cache is fully associative with 4 blocks. The block 
size of both the LI data cache and the replica victim cache is 32 bytes. Insensitive to 
the LI data cache size, the addition of the fully associative replica victim cache of 4 
blocks can achieve the loads with replica more than 98.1% on average. 
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Fig. 9. Loads with replica break- 
down for LI data caches of 8K, 16K, 
32K, 64K and 128K for bzip2. 



□ ICR BRVC 




Fig. 10. Loads with replica break- 
down for LI data caches of 8K, 16K, 
32K, 64K and 128K for equake. 



We also study the loads with replica for LI data caches with different asso- 
ciativity and find similar results. Therefore, with the addition of a small fully- 
associative replica victim cache of 4 entries, a very high loads with replica can 
be achieved to enhance data reliability of ICR further for a variety of LI data 
caches. 
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6 Conclusion 

This paper studies the limitation of In-Caclre Replication and proposes to add 
a small fully-associative replica victim cache to store the replica victim lines in 
case of their conflicts with the primary data. We find that with the addition of a 
small replica victim cache of 4 entries, the loads with replica of the ICR scheme 
can be increased by 21.7%. On average, 94.4% load hits in LI can find replicas 
either in the LI data cache or in the replica victim cache. We also propose and 
evaluate two different reliability enhancing schemes — RVC-P and RVC-C 
that are proven to be quite useful. 

RVC-P is a much better alternative for ICR-P-PS where one may want simple 
parity protection. It can enhance reliability significantly by providing additional 
replicas in the replica victim cache without compromising performance. RVC-P 
also has better performance than RVC-C or ECC based schemes (i.e., BaseECC). 

RVC-C can increase the error detection/correction capability by comparing 
the primary data and the replica before the load returns. Our error injection 
experiments reveal that RVC-C has the best reliability and can be used for 
applications that demand very high reliability or operate under highly noisy en- 
vironments. Compared with the BaseECC scheme, the performance degradation 
of RVC-C is only 1.7% on average. 

In summary, this paper proposes the addition of a small fully-associative 
replica victim cache to enhance data reliability of ICR schemes significantly 
without compromising performance. Our future work will concentrate on study- 
ing the reliability and performance impact of replica victim cache for multipro- 
gramming workloads. In addition, we plan to investigate how to use a unified 
victim cache efficiently to store both primary victims and replica victims and 
the possibility to make a better tradeoff between performance and reliability. 
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Abstract. In this paper we present an victim cache design for caches organized 
with line that contains multiple sectors (sector cache). Sector caches use less 
memory bits to store tags than non-sectored caches. Victim cache has been pro- 
posed to alleviate conflict misses in a lower associative cache design. This pa- 
per examines how victim cache can be implemented in a sector cache and pro- 
poses a further optimization of the victim buffer design in which only the tags 
of the victim lines are remembered to re-use data in the sector cache. This de- 
sign is more efficient because only an additional “OR” operation is needed in 
the tag checking critical path. We use a full system simulator to generate traces 
and a cache simulator to compare the miss ratios of different victim cache de- 
signs in sector caches. Simulation results show that this proposed design has 
comparable miss ratios with designs having much more complexity. 



1 Introduction 

In a cache an address tag (or tag) is used to identify the memory unit stored in the 
cache. The size of this memory unit affects how well a cache functions. For a fixed 
size cache larger unit size needs less memory bits to store tags and helps programs 
that possess special locality. However, larger unit may cause fragmentation making 
the cache less efficient when spatial locally is not there. Moreover, transferring each 
unit from lower memory hierarchy takes higher bandwidth. Smaller unit size allows 
more units to be included and may help programs that spread memory usage. 

Sector cache[ l][2] has been proposed as an alternative to strike a balance of cache 
unit sizes. A sector cache’s memory unit is divided into sub-sections. Each unit needs 
only one tag thus saves tag memory bits. These sub-sections of a sector cache need 
not to be simultaneously brought in the cache allowing lower transferring bandwidth. 
Another advantage of sector caches is observed for multiprocessors systems because 
they reduce false sharing [3] [4]. Sector cache’s advantage is evident in that many 
microprocessors employ sector caches in their designs. For example, Intel’s Pen- 
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tium® 4‘[5], SUN’s SPARC™[6] and IBM’s POWERPC™ G4™[7]/G5™[8] all 
employ sector cache in their cache organization. 

This work intends to propose and evaluate further optimization techniques to im- 
prove performance of a sector cache. One of those designs is the victim cache[9]. A 
victim cache includes an additional victim buffer. When a line is replaced it is put 
into this small buffer which is full associative instead of just being discarded. The 
idea is to share the victim buffer entries among all sets since only a few of them are 
hotly contended usually. First, we discuss how victim buffer/cache idea can be ap- 
plied in a sector cache. We evaluate two implementations of victim cache. One is 
called “line-victim” and the other is “sector- victim”. We further propose a third vic- 
tim mechanism design named “victim sector tag buffer”! VST buffer) for further util- 
ize the sector cache lines. This design tries to address a sector cache’s potential dis- 
advantage of having larger unit size and could be under-utilized. 

Since there are many different names[3][6][10][l 1][12][13][14][15] used to de- 
scribe the units used in a sector cache, we first describe the terminology used in this 
paper. In our terminology a cache consist of lines which have tags associated with 
each of them. Each line consists of sub-units which are called sectors. This naming 
convention is the same as described in the manuals of Pentium® 4[5] and 
POWERPC™[7][8]. An example 4-way set-associative cache set is shown in figure 1. 
A valid bit is added to every sector to identify a partial valid cache line. We also use 
the terminology s-ratio which is defined as the ratio between the line size and the 
sector size. A sector cache with s-ratio equals to p is called p-sectored cache as [1]. 
The example in figure 1 it is a 4-sectored cache. 



Address tag Cache data 



A 


V 


Sector 0(to A+SL) 


Line 1 


V 


Sector 1 (to A+2*SL) 


V 


Sector 2(to A+3*SL) 




V 


Sector 3(to A+4*SL) 


B 


V 


Sector 0(to B+SL) 


Line 2 


V 


Sector 1 (to B+2*SL) 


V 


Sector 2(to B+3*SL) 




V 


Sector 3(to B+4*SL) 



Address tag Cache data 
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Sector 0(to C+SL) 


Line 3 


V 


Sector 1 (to C+2*SL) 


V 


Sector 2(to C+3*SL) 




V 


Sector 3(to C+4*SL) 


D 


V 


Sector 0(to D+SL) 


Line 4 


V 


Sector 1 (to D+2*SL) 


V 


Sector 2(to D+3*SL) 




V 


Sector 3(to D+4*SL) 



Fig. 1 . Principles of sectored cache 

This paper is organized as follows. In this section we introduce the concept of sec- 
tor cache and victim mechanism. In the next section we first review other related 
works in this area. We then describe in more detail of our design. In section three we 
present the simulation methodology. In section four and five we introduce our simula- 
tion results on different cache levels. Finally we conclude with some observations. 



i 



Pentium is a registered trademark of Intel Corp. or its subsidiaries in the United States and 
other countries. 
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2 Sector Cache with Victim Buffer 

2.1 Related Work 

Sector caches can be used to reduce bus traffic with only a small increase in miss 
ratio[15]. Sector cache can benefit in two-level cache systems in which tags of the 
second level cache are placed at the first level, thus permitting small tag storage to 
control a large cache. Sector cache also is able to improve single level cache system 
performance in some cases, particularly if the cache is small, the line size is small or 
the miss penalty is small. The main drawback, cache space underutilization is also 
shown in LI 3]. 

Rothman propose “sector pool” for cache space underutilization [ 13]. In the design, 
each set of set-associative cache compose of totally s-ratio sector lists. Each list has a 
fix number of sectors that the number is less than the associativity. S-ratio additional 
pointer bits, associate with a line tag, point to the actual sector as the index of the 
sector list. Thus a physical sector can be shared in different cache lines to make the 
cache space more efficient. Unlike our victim mechanism who tries to reduce the 
cache Miss ratio, this design more focus on cache space reduction. It depends on a 
high degree set associative cache. The additional pointer bits and the sector lists will 
make the control more complex. For example, the output of tag comparison need to 
be used to get the respond pointer bit first then can get the result sector. This length- 
ens the critical path. Another example is that different replacement algorithms for the 
cache lines and sector list need to be employed at the same time. 

Seznec propose “decoupled sectored cache”[l][l 1]. A [N,P] decoupled sectored 
cache means that in this P-sectored cache there exists a number N such that for any 
cache sector, the address tag associated with it is dynamically chosen among N possi- 
ble address tags. Thus a log2N bits tag, known as the selection tag, is associated with 
each cache sector in order to allow it to retrieve its address tag. This design increases 
the cache performance by allow N memory lines share a cache line if they use differ- 
ent sectors that some of the sectors have to be invalid at normal sector cache design. 
Our concern about this design is that the additional tag storage, say N-l address tags 
and s-ratio * log2N selection tags for each line, need large amount of extra storage, 
Seznec himself use large(32 or 64) s-raio, which will make the validity check and 
coherence control very complex, to reduce tag storage before decoupling. We tried to 
implement this idea and saw the line-fill-in and line-replacement policy is important 
for the performance. If with a line-based-LRU-like fill-in/replacement policy pro- 
posed by ourselves, since Seznec did not give enough details of his policies, the de- 
coupled sector cache will not perform better than our VST design, if with similar 
extra storage, given s-ratio range of 2-8. And, an additional compare need to be per- 
formed to the retrieved selection tag to ensure the sector data is right corresponded to 
the address tag which causes the tag matching. This also lengthens the tag checking 
critical path. 

Victim caching was proposed by Jouppi[9] as an approach to reduce the cache 
Miss ratio in a low associative cache. This approach augments the main cache with a 
small fully-associate victim cache that stores cache blocks evicted from the main 
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cache as a result of replacements. Victim cache is effective. Depending on the pro- 
gram, a four-entry victim cache might remove one quarter of the misses in a 4-KB 
direct-mapped data cache. Recently [16] shows that victim cache is the most 
power/energy efficient design among the general cache misses reduction mechanisms. 
Thus it becomes a more attractive design because of the increasing demand of low 
power micro-architecture. 



2.2 Proposed Design 

In order to make our description clearer, we define several terms here. We call a ref- 
erence to a sector cache a block-miss if the reference finds no matching tag within the 
selected set. We call a reference a data-miss if the tag matches but the referenced 
word is in an invalid sector. Thus a miss can be either block-miss or data-miss. Simi- 
lar to [ 1 ] [1 1 ] describe, for a P-sectored cache we divide the address A of a memory 
block in four sub-strings (A3, A2, Al, AO) defined as: AO is a log2SL bit string 
which SL is the sector length, Al is a log2P bit string show which sector this block is 
in if it is in a cache line, A2 is a log2nS bit string which nS is the number of the cache 
sets, A3 consists of the remaining highest significant bits. The main cache line need 
store only the bits in A3. Figure 2 show tag checking of directed-mapped case. A2 
identify the only position of the tag to be compared in the tag array. (A2, Al) identify 
the only position the data sector can be. The data can be fetched without any depend- 
ency on the tag comparison. The processor pipeline may even start consuming the 
data speculatively without waiting for the tag comparison result, only roll back and 
restart with the correct data in the case of cache miss which is rarely happened, as line 
prediction. 




Fig. 2. Directed mapped sector cache tag checking 

In the case of set-associative cache, (A2, Al) can only select the conceptual “sec- 
tor set”, then waiting for the comparison result of the address tags to get a line ID to 
deliver the correspond sector. Figure 3 is such an example of a 2-way associate sector 
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cache. In a lower-associate cache the sector data and the valid bits being select can be 
got independently with the tag comparison. 




Fig. 3. 2-way associate sector cache tag checking 



In figure 3 a line ID is needed, in critical path, as selection signal of the MUX. 
Line ID is founded after the tag comparison result. For a higher associate cache like a 
CAM, where a simple MUX may not be used, the data and valid bits could be not 
right at hand immediately. But Line ID retrieving still dominates in the critical path 
there [17]. 

As mentioned by many researchers victim cache can reduce cache Miss ratio. 
There are two straightforward victim designs for sector cache. One is line-victim 
cache(LVC), the other is sector-victim cache(SVC). Figure4 show their tag checking. 
Tag checking of the line victim cache is in the left and the other is in the right. The 
most difference between them is the data unit associate with the victim tag. In line- 
victim cache, the data unit is a cache line. And the data unit is a sector in the sector- 
victim cache. Thus the lengths of the victim tags of LVC and SVC are different. For 
same entries number LVC can be expected more cache misses saved due to more 
storage there, where SVC can be expected a little faster tag checking and data retrieve. 
Figure 4 do not connect the victim cache with main cache to avoid unnecessary com- 
plexity and allow architects to decide if swap the victim data with the main cache data 
when hit victim cache. 

Both line-victim cache and sector-victim cache are paralleled accessed with the 
main sector cache. A cache line is evicted to the line- victim cache in case of cache 
replacement happens. As to the sector-victim cache, only the valid sectors in the 
whole line are evicted to the sector-victim-cache. Also when a new line is brought 
into cache, the sector-victim cache is checked to see if there are other sectors in the 
same line. If so the victim sectors are also brought into the main cache line to main- 
tain a unique position of a cache line. 
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We know some of the requested data may still be in the data cache but it is just in- 
accessible because it has been invalidated. This paper describes another approach, 
called “VST buffer” to remember what is still in the data cache. 




Fig. 4. Tag checking of line-victim cache and sector-victim cache 



When a block miss happens and the set is full, a cache line must be replaced. Each 
sector of the replaced line will be mark as invalid. A new sector will be brought into 
the replaced line, and the cache tag will be updated. Thus some of the previously 
replaced line’s sector data may still be in the data array since not all sector, of the 
newly brought in cache line, is brought in. Only their valid bits are marked invalid. 
VST buffer is used to keep track of these sectors whose data is still the data array. 
Thus a VST entry consists of the victim tag, the victim valid bits and the “real loca- 
tion” line ID in the cache set. For a directed mapped main cache the line ID field is 
needless. The left side of figure 5 shows the VST buffer tag checking with a directed- 
mapped main cache. As seen from the figure the VST buffer produce an additional 
“VST hit” signal to be perform “or” operation with the main cache hit signal in the 
critical path, without affecting the sector fetching and consuming. In either a VST hit 
or a main cache hit the data can be processed continuously. 

The right side of figure 5 shows the VST buffer tag checking of a set-associate 
main cache. A VST hit not only leads to a hit result but also deliver a line ID to the 
main cache selector to get the result data. This line ID signal is performed “or” opera- 
tion with the main cache line ID signal to select final line. One extra cost here is when 
a new sector is brought into the main cache, if a data miss happens, the VST needs to 
be checked if the position contains a sector being victimized. If so the victim entry is 
invalidated or thrashed. This does not increase cache-hit latency since it happens 
when cache miss. Since there is already cache miss penalty the additional cost seems 
to be acceptable. 

When compare the cost of the three victim mechanism connected with a p- 
Sectored Cache all compose of N entries. We see beside the similar comparators and 
the control, the line- victim cache need N line tags, data of N * line size and N*P 
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Fig. 5. Tag checking of Victim Sector Tag Buffer (VST buffer) with sector cache 



valid bits; the sector-victim cache need N sector tags (each of it is log2P bits longer 
than a line tag), data of N * sector size and N valid bits; and the VST buffer need N 
line tags, N * P valid bits and N * log2Assoc bits of line IDs which Assoc is the cache 
associativity. So the line victim cache needs most resource among them as VST 
buffer need least resource. 

In MP system, where the sector cache is proved efficient, there need additional 
cache coherence protocol, like MESI, to maintain the cache coherence. We think the 
victim mechanism will make the MP sector cache coherence protocol more complex. 
But we will not discuss the details here since it is beyond this paper’s scope. 



3 Simulation Methodology 

Several SpecCPU2K[ 18] benchmarks (compiled with compiler option “-03 -Qipo”), 
Java workload SpecJBB2K[19] with Java runtime environment JSEV1.4.1 which is 
an integer benchmark, and two commercial-like floating-point benchmarks, one is a 
speech recognition engine [20], the other is an echo cancellation algorithm[21 ] in 
audio signal processing, are used in our study. 

In order to consider all the effects, including system calls, we use a full system 
simulator to generate memory reference traces. The simulator used is called 
SoftSDV[22]. The host system runs Windows 2000 and the simulated system is Win- 
dowsNT in batch mode using solid input captured in files. Then we run the traces 
through a trace-driven cache simulator. 

We generate both LI memory reference traces and L2 memory reference traces. 
After 20 billion instructions after system start up (the target application is configured 
auto-run in the simulation) we collect 200 million memory references as our LI traces. 
We use 100 million references of them to warm up LI cache and analysis the behav- 
ior of the latter 100 million. The LI sector cache we simulated is mainly configured 
as below with small varieties: 16KBsize, 64B line size, 16B sector size, 4 way associ- 
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ate, LRU replacement algorithm and write-back approach. For L2 cache behavior we 
use a built-in first level cache together with trace generation. We warm up the built-in 
cache with 1 billion instructions. Then we collect L2 traces consist of 200 million 
read references. Also in our simulation we use 50 million L2 references to warm up 
the L2 cache. The hierarchy consist L2 sector cache we simulated is mainly config- 
ured as below with small varieties: LI: 16KBsize, 32B line size, 4-way associate, 
LRU replacement algorithm and write-back approach. L2: 1MB size, 128 byte line 
size, 32 byte sector size, 8-way associate, LRU replacement algorithm and write-back 
approach. 



4 Level 1 Sector-Cache Simulation Results and Discussion 

We present the LI simulation data as the Miss Ratio Improvement Percentage (MRIP) 
of all benchmarks. The reason that we present LI data first is that it is easier to corre- 
late the observed LI behavior back with the source code. Figure 6,7 are the MRIP 
trends with various parameters as the variable. All the numbers are computed as the 
geometric means of the different workload data also list in the paper. Figure 6 indi- 
cates that with larger number of the victim mechanism entries the miss ratio im- 
provement increases. Since VST requires no data array we can implement a much 
larger victim buffer at the same cost of a smaller SVC/LVC and achieve the same (or 
even better) performance improvement. For example 128 entries VST performs com- 
parably with 64 entries SVC or 32 entries LVC. Figure 6 also explores the improve- 
ment with several sector cache line sizes and sector sizes. We observed that VST 
performs better with larger s-ratios. This is because of higher underutilization cache 
space exist with higher s-ratio. On the other hand SVC and LVC performs better with 
larger line and sector sizes. Figure 7 compares how the victim mechanisms affect 
caches with different associativities or different cache sizes. It is not surprising to 
learn that all three forms of victim mechanisms help the lower associative cache bet- 
ter. This is because higher associativity already reduced much of the conflict misses 
victim cache is targeting. It is also seen smaller LI cache benefits more from the 
victim mechanisms. As frequency of microprocessors continues to grow, smaller but 
faster (lower associativity gives faster cache too) cache will be more prevalent. 




Fig. 6. MRIP with victim entries or line/sector sizes (higher is better) 
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We observe that LVC gives the best Miss ratio improvement at the highest hard- 
ware cost. While the SVC approach we used for this study needs the second highest 
hardware cost, it is not better than VST approach. The VST approach is a reasonable 
approach in terms of hardware design complexity and overhead. 




DM 2way 4way 8way 16 Way 




8 16 32 64 128 256 



Fig. 7. MRIP with different associativities or LI sizes 



The cache miss ratios with different number of victim entries, correspond to the 
left figure of figure 6, are listed in table 1 . The data of other figures are listed in ap- 
pendix. Table 1 also list corresponding block misses ratios for further investigation. 



Table 1 . Miss Ratios and Block Miss Ratios with numbers of victim entries 
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As shown in table 1, the benchmark “mesa” got most of cache misses reduction 
with victim mechanism regardless LVC/SVC, or VST we used, “ammp” got least 
misses reduction with LVC and “saec” got least misses reduction with SVC and VST. 

For the workload “mesa”, we observed the block Miss ratio reduce much more 
significantly with victim mechanism compared to the cache Miss ratio. Thus with 
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victim mechanism the workload basically keeps more cache lines to save cache 
misses in this level. Other issues, like quantitative spatial localities that make SVC 
performs differently, say reduces different percentage of miss ratio reduced by LVC 
with same entries, play minor role in this level. 

In some cases (GCC with 8 victim entries), the VST buffer approach performs bet- 
ter than LVC even without any data array. After investigation we concluded that the 
VST buffer approach sometimes uses the victim buffer more efficiently and can avoid 
be thrashed. Victim cache contains data that may be used in future. But the data can 
also be kicked out of the victim cache before it is needed. For example, streaming 
accesses, if miss the main cache, will evict main cache lines to update the victim 
cache. Thus the victim cache gets thrashed and may lost useful information. It plays 
differently in VST approach. We see in non-sector cache, streaming accesses are 
mapping in different sets of cache which make it difficult to be detected. In a sector 
cache the next sector of a cache is inherently subsequence of the previous sector. 
Figure 8 shows the VST states with one by one streaming accesses (or sequential) 
going to the cache, only one VST entry is enough handling them since the entry can 
be re-used(disabled) after a whole main cache line fill-in. Thus the whole buffer will 
keep longer history. This is right the case VST performs better than LVC for GCC. 



Ld A 

Ld A + sbsize 
Ld A + sbsize * 2 



Ld A + sbsize * 2 

Fill new content on the entry 




Fig. 8. Avoid be thrashed by streaming access 



5 Level 2 Sector-Cache Simulation Results and Discussion 

We also explore the possibility of applying our proposed methods on level-two cache 
design. This time only those references that missed the build-in level one cache are 
collected in the trace file. Table 2 illustrates the tabulated result in terms of miss ratio 
for various entries. Data with other parameters are also listed in appendix. 

There are several observations made from the L2 data. First, LVC performs better 
than SVC with same entries but worse than SVC with s-ratios, here 4 times, of entries, 
same as be observed from LI data. Second, in lower level set-associative cache, vic- 
tim mechanism performs differently as LI. It does not save so many cache misses as 
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Table 2. L2 Miss Ratio with Victim Mechanism 
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LI cache. This is not surprising since a small LI already catch a significant part of 
data locality and L2 reference patterns tend to be more irregular. Third, the VST 
buffer performs well among the three victim mechanisms in this memory hierarchy 
level. It can outperform LVC and SVC for the benchmark “ammp”. Even it is more 
difficult to correlate the L2 references back with the source or binary, than LI refer- 
ences. We still ascribe the better VST performance to its property of avoiding be 
thrashed. As to the workloads, “ammp” and “SAEC” get most significant cache 
misses reduction here. This behavior is opposite to the LI behavior. Also the signifi- 
cant block miss reduction can not be observed in this level as the data in appendix 
shows. Thus we suggest that the extra storage of LVC and SVC benefit more from the 
general data locality; and VST benefit more from the cache underutilization whether 
the reference pattern is regular or not. 



6 Conclusion 

We have described three possible implementation of victim buffer design in a sector 
cache. They have different complexity and hardware overhead. Several up-to-date 
applications are used to evaluate their performance in terms of miss ratio. Overall 
three mechanisms have comparable cache misses reduction. For a directed-mapped 
Level 1 cache, the mechanisms can save significant amount of cache misses. 

Among the three mechanisms LVC gives the best performance with highest over- 
head. Whether SVC is performance/cost effective or not rely on the quantitative spa- 
tial locality of the workload. 
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We also investigate several benefits of VST in this paper. Include the low-cost de- 
sign, keeping longer victim history and be more able to capture irregular reference 
pattern in lower memory hierarchy. 
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Abstract. Recent research results show that conventional hardware-only cache 
replacement policies result in unsatisfactory cache utilization because of cache 
pollution. To overcome this problem, cache hints are introduced to assist cache 
replacement. Cache hints are used to specify the cache level at which the data is 
stored after accessing it. This paper present a compiler-assisted cache replace- 
ment policy, Optimum Cache Partition (OCP), which can be carried out 
through cache hints and LRU replacement policy. Presburger arithmetic is used 
to exactly model the behavior of loop nests under OCP policy. The OCP re- 
placement policy results in plain cache behaviors, and makes cache misses ana- 
lyzing and optimizing easily and efficiently. OCP replacement policy has been 
implemented in our compiler test-bed and evaluated on a set of scientific com- 
puting benchmarks. Initial results show that our approach is effective on reduc- 
ing the cache miss rate. 



1 Introduction 

Caches play a very important role in the performance of modern computer systems 
due to the gap between the memory and the processor speed, but they are only effec- 
tive when programs exhibit data locality. In the past, many compiler optimizations 
have been proposed to enhance the data locality. However, a conventional cache is 
typically designed in a hardware-only fashion, where data management including 
cache line replacement is decided purely by hardware. Research results [1] reveal that 
considerable fraction of cache lines are held by data that will not be reused again 
before it is displaced from the cache. This phenomenon, called cache pollution, se- 
verely degrades cache performance, A consequence of this design approach is that 
cache can make poor decisions in choosing data to be replaced, which may lead to 
poor cache performance. 
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There are a number of efforts in architecture designed to address this problem, the 
cache hint in EPIC [2,3] (Explicitly Parallel Instruction Computing) architectures is 
one of them. Cache hints are used to specify the cache level where the data is stored 
after accessing it. Intuitively, two kinds of memory instructions should be given cache 
hints [12]: i) whose referenced data doesn’t exhibit reuse; ii) whose referenced data 
does exhibit reuse, but it cannot be realized under the particular cache configuration. 
It sounds as though the problem is pretty simple for regular applications, and existing 
techniques for analyzing data reuse [4] and estimating cache misses [5, 6] suffice to 
solve this problem. This plausible statement, however, is not true because a funda- 
mental technique used in cache miss estimation — footprint analysis — is based on 
the assumption that all accessed data compete for cache space equally. However, in 
EPIC architectures, memory instructions are not homogeneous — those with cache 
hints have much less demand for cache space. This makes the approach derived from 
traditional footprint analysis very conservative. In summary, the following cyclic 
dependence [12] exists: Accurate cache miss estimation must be known to cache hint 
assignment, while accurate cache miss estimation is only possible when cache hint 
assignment is finalized. 

In this paper, we present a novel cache replacement policy, Optimum Cache Parti- 
tion (OCP), to address the above problem. The OCP cache replacement policy can be 
carried out through combining hardware cache replacement policy LRU (Least Re- 
cently Used) with compiler generating cache hints. Presburger [15] arithmetic is used 
to exactly model the behavior of loop nests under OCP policy, the OCP replacement 
policy makes cache miss estimation simpler through simplifying cache behaviors 
fundamentally. 

We have evaluated the benefit of the OCP cache replacement policy on reducing 
the cache miss rate through executing a set of SPEC benchmarks on Trimaran[7] and 
DineroIV[8]. Trimaran is a compiler infrastructure for supporting state of the art 
research in EPIC architectures, and DineroIV is a trace-driven cache simulator. Initial 
experimental results show that our approach reduces data cache misses by 20.32%. 

The rest of this paper is organized as follows. Section 2 briefly reviews the basic 
concept of cache hint. Section 3 illustrates, through an example, a compiler-assisted 
cache replacement policy. Cache Partition. Section 4 uses Presburger arithmetic to 
analyze the relationship between reuse vector and cache miss rate, to estimate cache 
hits and misses, and then an optimum cache replacement policy, Optimum Cache 
Partition, is derived. Our implementation and experimental results are then presented 
in Section 5. Section 6 discusses related work. Section 7 concludes this paper. 



2 Cache Hints 

One of the basic principles of the EPIC-philosophy is to let the compiler decide when 
to issue operations and which resources to use. The existing EPIC architectures (HPL- 
PD [2] and IA-64[3]) communicate the compiler decisions about the cache hierarchy 
management to the processor through cache hints. The semantics of cache hints on 
both architectures are similar. Cache hints are used to specify the cache level at which 
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the data is likely to be found, as well as the cache level where the data is stored after 
accessing it. 




L 1 L2 ('ache L 1 L2 ('ache 




candidate 



before execution 



after execution 



Makes a data line in the cache to be the replacing candidate (cache hit) 




before execution after execution 

Makes a data line to pass through the cache (cache miss) 



Fig. 1. Example of the effect of the cache hints in the load instruction Id.ntl 

In the IA-64 architecture, the cache hints tl , ntl, nt2 and nta are defined, tl means 
that the memory instruction has temporal locality in all the cache levels, ntl only has 
temporal locality in the L2 cache and below. Similarly, nt2 respectively nta indicates 
that there's only temporal locality in L3 respectively no temporal locality at all. An 
example is given in figure 1 where the effect of the load instruction Id.ntl is shown. 

Generating cache hints based on the data locality of the instruction, compiler can 
improve a program’s cache behavior. As shown in figure 1, cache hints have two 
effects on cache behaviors: i) when cache hits, it make the current reference line, 
which is stayed in the cache, to be the replacing candidate, ii) when cache misses, it 
make the current reference line, which is not in the cache, to pass through the cache 
without being stored in the cache. Examples of these effects are given in figure la 
and lb. 

With cache hints, our compiler can carry out the OCP cache replacement policy 
through controlling cache staying time of accessed data, as a result, the cache pollu- 
tion is reduced and cache miss rate is decreased availably. 



3 A Compiler-Assisted Cache Replacement Policy: 

Cache Partition 

Let's see a short memory access stream in Figure 2. For a 4-lines LRU cache, there 
are 10 cache misses for the memory access stream in Figure 2, M LRU =10. 

For the same cache, if the cache is partitioned into three parts logically: 3 lines, 0 
lines, 1 line, and then the three parts has been assigned to the three references, A, B 
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and C respectively. In that case, memory access stream of the reference A can hold 3 
cache lines at most, the reference C can only hold 1 cache line, the reference B can 
not hold any cache line. Under this cache partition, the cache miss analysis of these 3 
references is shown in Figure 3. It shows that the cache miss analysis of these 3 refer- 
ences can proceed independently, there are no influence mutually. 

Instru ctioo/reference : 

A.AABCAAA3C 
Data location: 

bl b? b3 b5 b8 bl b? b3 b6 b8 

tine ^ 

Fig. 2. The top row indicates 10 memory reference instructions. The bottom row shows the 
corresponding memory locations. 

Making reasonable cache partition and limiting the appropriate number of cache 
lines hold by each reference, cache misses can decrease obviously. The "limiting" can 
be carried out by a compiler through appending cache hints to some memory instruc- 
tions. For example, not to let reference B hold any cache line, a cache hint is ap- 
pended to reference B instruction; To limit the number of cache lines hold by refer- 
ences A and C, a cache hint is appended to reference A and C instruction condition- 
ally. 



Cache lines 


4 


Cache Partition 


3 


0 


i 


Reference 


A 


B 


c: 


Access stream 


bl,b2,b3,bl,b2,b3. 


b5,b6 


b8,b8 


Cache misses 


3 


2 


1 



Cache miss analysis for Cache Partition (3,0,1) 




LRU 




LRU 




LRU 


3 lines 




0 lines 




1 lines 


A 




13 




C 



M ( p= 31211=6 



Fig. 3. 4-lines LRU cache is divided logically into three LRU sub-caches, and cache misses 
decrease from 10 to 6. 



With cache hints, as above example shows, a compiler can construct a novel cache 
replacement policy, where the cache has be partitioned logically into parts and each 
reference data in a loop nested program can be limited within one part, as if there is a 
sub-cache for each reference. Under this cache replacement policy, reference cache 
behaviors will not influence each other, and cache misses of a reference can be ana- 
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lyzed independently. The cache behaviors under the cache replacement policy are 
more plain and more independent than only-LRU replacement policy. 

Definition 1. For a LRU cache with N lines, a Cache Partition <F of a loop nest with m 
references is a m-tuple ( C p C 2 , ..., CJ, where VC, < N and cache lines hold by i-th 

i-i 

reference Ref at any moment can not exceeds C lines. 

A Cache Partition (CP) is a cache replacement policy that achieves two aims for 
the reference Ref : (i) there are C lines in cache that can be hold by the reference Ref 
at any moment; (ii) the number of cache lines hold by the reference Ref cannot ex- 
ceed C at any moment. The CP replacement policy, that can be carried out by a com- 
piler through appending cache hints to some memory instructions, simplifies loop 
nest cache behaviors, however there is a crucial problem: can it minify cache misses? 
To address this problem, an analysis of cache misses under the CP policy is taken, 
and an optimum CP policy is derived in the next section. 



4 Cache Misses Analysis and Optimum Cache Partition 

In this section, the Optimum Cache Partition is derived after cache misses analysis 
under the CP replacement policy. A 
matrix-multiply loop nest program 
MXM(see Figure 4) is used as our 
primary example, and all arrays 
discussed here are assumed to be 
arranged in column-major order as 
in Fortran. 



DO i = 1, U1 
DO k = 1, U2 
DO j =1, U3 

Z(j,i)=Z(j,i)+X(k,i)*Y(j,k) 
Fig. 4. Matrix-multiply loop nest, MXM 



4.1 Terminology 

Our research is based on analyzing references’ cache behavior using iteration spaces 
and reuse vectors. 

A reference is a static read or write in the program, while a particular execution of 
that read or write at runtime is a memory access. 

Throughout this paper, we denote the cache size as C, line size as L , the number 
of cache lines as N r then C = L * N r 

Iteration space. Every iteration of a loop nest is viewed as a single entity termed 
an iteration point in the set of all iteration points known as the iteration space. For- 
mally, we represent a loop nest of depth n as a finite convex polyhedron of the n- 

dimensional iteration space S — IIP ,6'jjcZ", bounded by the loop bounds U i 

i-i 

(1 </<«). Each iteration in the loop corresponds to a node in the polyhedron and is 
called an iteration point. Every iteration point is identified by its index vector 7 = (ii, 
i 2 , .... i n ) S, where i) is the loop index of the f-th loop in the nest with the outermost 
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loop corresponding to the leftmost index. In this representation, if iteration p , exe- 
cutes after iteration p t we write p 2 >- p l and say that J>. : is lexicographically greater 
than p x . 

Reuse Vector. Reuse vectors provide a mechanism for summarizing repeated 
memory access patterns in loop-oriented code[4]. If a reference accesses the same 
memory line in iterations /j and / , where J >- J x , we say that there is reuse in direc- 
tion r =i 2 -\, and r is called a reuse vector. For example, the reference Z(j. i) in 
Figure 4 can access the same memory line at the iteration points (i, k, j) and (i, k+1, j), 
and hence one of its reuse vectors is (0, 1, 0). A reuse vector is repeated across the 
iteration space. Reuse vectors provide a concise mathematical representation of the 
reuse information of a loop nest. 

A reuse vector is realized. If a reuse vector results in a cache hit we say that the re- 
use vector is realized. 

Hence, if we have an infinitely large cache, every reuse vector would result in a 
cache hit. In practice, however, a reuse vector does not always result in cache hit. The 
central idea behind our model is to maximize realized reuse vectors. 



4.2 Cache Misses Analysis Under Cache Partition Replacement Policy 

Under a Cache Partition (F= (C' r C 2 ,..., CJ replacement policy, memory accesses of 
every reference Ref. (1 </<;«) in a loop nest can hold C ( cache lines at most, and 
cannot be influenced by other reference accesses, as if reference Ref. engrosses a C- 
lines sub-cache, Cache t , independently (see Figure 5). So, the cache behavior under 
Cache Partition Of = (Cj, C,,..., C m ) replacement policy is same as the cache behavior 
of every reference Ref (1 < i < m) in Cache , with C cache lines respectively. 



LRU 


K 


LRU 
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Cache \ 


Cache > 




Cache? 




Cache m 






Partition / 
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all R e/s 




LJSS V 


Refi 




Ref? 




Rcfn, 











Cache misses for all Refs suinf Cache! misses for Ref ) ( 1 < i< mi 
Fig. 5. Cache Partition Replacement Policy (R= (C r C,,..., C m ) 



Now, we use Presburger arithmetic! 15] to exactly model the cache behavior of 
reference Ref (1 <i<m) in Cache, with C cache lines. Mostly cache hits of a 
reference in a loop nest are produced by one or two reuse vectors of the reference, so 
we analyze primarily the cache behavior of a reference while only one or two reuse 
vectors of the reference are realized, but these conditions, that more than two reuse 
vectors are realized, are considered also. The cache behavior analyses have four steps: 
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1) When a reuse vector of a reference is realized in whole iteration space, the cache- 

hit iterations of the reference are formulated. For reuse vectors r of the refer- 
ence Ref, a formula CHI( r ) is generated. CHI( r ) represents all cache-hit itera- 
tions when reuse vector r is realized in whole iteration space; 

2) The number of cache lines, that cache, needed for to realize a reuse vector r of 

Ref. in whole iteration space, is counted; 

3) After steps 1 and 2, a hits-cost pair, <Hits, Cost>, for every reuse vector r of the 

reference Ref is constructed. This hits-cost pair <Hits, Cost> means that: if 
cache lines assigned to Ref is not less than the cost , namely Cost< C, then Ref 
would produce cache hit hits times in whole iteration space by it’s reuse vector 
r . The set of hits-cost pairs of Ref is noted as HCSet,. 

4) According to HCSet, (1 <i<m), H r , the number of cache hits under the Cache 

Partition Q? replacement policy, is estimated, and then the Optimum Cache Parti- 
tion, the Cache Partition that results in maximum cache hits, is derived. 

The four steps are explained in further detail below. 



1. Cache-Hit Iterations 

When a reuse vector r of reference Ref is realized in whole iteration space S, the set 
of cache-hit iterations of reference Ref is formulated as CHI( r ): 

CHK r )={p | (p G S) A (3/3': (p'e S ) a (p = p'+r ))} (i) 

The above formula means that reference Ref at iteration p can reuse through reuse 
vector r if and only if there is another iteration /5'that can be hit by p through 
reuse vector r , expressed as p — p'+r . 

The number of cache-hit iterations in CHI( r ) can be calculated as following: 
(consider r = (r p r 2 , ..., rj ) 

n 

|CHI( r )| = n (£/; — \ r i |) (1.1) 

1=1 

For a spatial reuse vector r s = (r ,, r s2 , r m ), formulas (1) and (1.1) is unsuitable. 

Cache behaviors of a loop nest with a spatial reuse vector are complicated. To get a 
concise formula for a spatial reuse vector, we modified formula (1.1) by appending a 
coefficient without a complicated model: 

|CHI(r )| = -|r„.|))x(^^) (1.2) 

i=l br s 

In formula (1.2), b J- stands for iteration times of the reference Ref in a single 
cache line along the spatial reuse vector r s . Among the b f times iterations in a sin- 
gle cache line, there are (bF s -l) iterations to be reused, so the cache hits number of a 
spatial reuse vector has a coefficient (br -l)/br . 
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When two reuse vectors l] and r 2 of reference Ref. are realized in whole iteration 
space S, the set of cache-hit iterations of reference Ref is formulated as CHI( /' , p ): 

CHI(rj ,r 2 ) = {p\(pe S)a(3p':(p'g S) a(/3 = p'+r x v p = p'+p))} d’) 

When k reuse vectors l\ , p r k of reference Ref are realized in whole iteration 
space .S', the set of cache-hit iterations of reference Ref. is formulated as CHI( t] , p , 

r k ): 

CHI(r 1 ,r 2 ,...,r t )= 

{p | (p e 5) a ( 3/3': (/3'e 5) a (p = p+r t vp = p'+r 2 v ... v p = p'+r* ))} 

( 1 ”) 



2. The Number of Cache Lines 

This step is to count the number of cache lines NCL( r ), that cache . needed for to 
realize a reuse vector r of Ref in whole iteration space. First, the iteration set 
ACI( r , q ) is defined. At any iteration q G S , if these data lines, accessed by Ref at 

these iterations in ACI( r ,q), is hold by cache n then the reuse vector r would be 

realized in whole iteration space. Furthermore, the number of cache lines NCL( r ) in 
cache . is derived. 

\/q G S : ACI( r ,q )= {p | (p G S) a (p + r G S) a (p < q) a (p + r >- q)} 

( 2 ) 

ACI( r , q ) is similar as reference windows! 14] which hold by a cache can result 
in the reuse vector r realized. 

NCL ( r ) = max] |ACI ( r ,q)\ \ q G S } (3) 

When cache lines in cache . is not less than NCL( r ), cache t can hold all data lines 
accessed at these iterations in ACI( r ,q ), and then the reuse vector r would be real- 
ized in whole iteration space. 

To avoid plentiful calculation for ACI( r ,q), we give a simple estimation formula 
(3.1). We observe that |ACI( r , q )| is not greater than 
| {p | (p G S) A ( p -< q) A (p + r A </)}| that is the number of the iteration points 

within a reuse vector r in the loop nest iteration space, so NCL( r ) can be estimated 
as: 

n n n 

NCL ( r ) = where n^=l (3-1) 

1=1 7=1+1 j = n + 1 

When two reuse vectors T\ and r 2 of reference Ref are realized in whole iteration 
space S, the ACI( r { ,r 2 ,q) and NCL( , r 2 ) is formulated as following: (consider 

A ■< r 2 ) 
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\/q e S : ACI( Fj , F , , q ) = ACI( r x , q ) U 

{p | ( p e S) a (p + r 2 e S) a ( p <q) a (/3 + r 2 >- q) a (p + F 2 - #■ g 5)} (2’) 
NCL( Fj , F 2 ) = max{ |ACI( Fj,F 2 ,g)| | ge 5 } (3’) 

When two reuse vectors r x ,r 2 ,..., r k of reference Ref are realized in whole itera- 
tion space S, the ACI( r x , r 2 r k , q ) and NCL( r l ,r 2 ,..., r k ) is formulated as fol- 
lowing: (consider r x -< r 2 -< ... -< r k ) 

\/q e S : ACI( Fj , F 2 r k ,q) = ACI( T \ , F, F^_, , ^ ) u 

{/3 | (p e S) a (p + F, e 5) a (/3 -< a (p + r k y q) a 
(F + r k - Fj g 5) a ... a (/? + r k - r k _ 1 £ 5)} (2”) 

NCL(Fj,F 2 ,..., F^.) = max{ |ACI(F 1 ,r 2 ,... ) r k ,q)\ \ qeS} (3”) 

3. Hits-Cost Pairs Set 

This hits-cost pair, <Hits, Cost>, means that: if cache lines C„ assigned to Ref], is not 
less than the Cost, namely Cost < C , then Ref] would produce cache hits Hits times in 
whole iteration space by it’s reuse vectors. The set of hits-cost pairs of Ref] is noted as 
HCSetj. 

HCSet = { <0, 0>, < |CH1( F )|, NCL( F ) >, < |CHI( r\ , r 2 )|, NCL( F t , r 2 ) >, . . 

<|CHI(F 1 ,F 2 ,..„ tj t )|,NCL(F 1 ,F 2) ..., r k )> | F,F 1 ,F 2 ,..„ r k 

are reuse vectors of Ref } (4) 

The hits-cost pair <0, 0> means that: no cache line assigned to Ref, no cache hit 
produced. HCSetj describes the relations of Ref between the number of cache lines 
and the number of cache hits. To calculate HCSet, in our experiments, these hits-cost 
pairs, < |CHI( Fj , F 2 ,..., r k ) |, NCL( F, , F 2 ,..., r k ) > (k > 2), were ignored, there are 

two reasons: i) the cache lines is not enough to realize more reuse vectors; ii) most 
cache hits of a reference is produced by one or two reuse vectors of the reference. 

4. Optimum Cache Partition 

To assign C, cache lines to reference Ref under the Cache Partition -F replacement 
policy, the number of cache hits of the reference Ref can be estimated as following: 

Hitj(Cj) = max { Hits \ ( <Hits, Cost> € HCSet, ) A ( Cost < C ) } (5) 

Under the Cache Partition F = (C p C 2 ,..., C m ) replacement policy, cache hits of a 
loop nest can be estimated as follow: 

m 

(6) 

i=i 

Formula (6) estimates cache hits number of a loop nest program under a Cache 
Partition. Different Cache Partitions can bring different cache hits number, a Cache 




Cache Behavior Analysis of a Compiler-Assisted Cache Replacement Policy 



39 



Partition, that brings the most cache hits, is called Optimum Cache Partition replace- 
ment policy. 

Definition 2. Optimum Cache Partition of a loop nest. IF is a Cache Partition of a loop 
nest, for any other Cache Partition P' if this inequation < H r . is always true, then 
P is called Optimum Cache Partition ( OCP) for a loop nest. 

Figure 6 shows all hits-cost pairs for a MXM program while one temporal reuse 
vector or one spatial reuse vector is realized. The reference X(k, i) has a temporal 
reuse vector j- t = (0,0,1), a spatial reuse vector f s = (0,1,0). According to Formulas (1 

~ 6), when a temporal reuse vector (0,0,1) is realized, there is a Hits-Cost pair < 
990000, 1> that means if there is only 1 cache line assigned to reference X(k, i), X(k, i) 
would produce cache hit 990000 times by temporal reuse vector (0,0,1); when a spa- 
tial reuse vector (0,1,0) is realized, there is a Hits-Cost pair <742500, 100> that 
means if there is 100 cache line assigned to reference X(k, i), X(k, i) would produce 
cache hit 742500 times by spatial reuse vector (0,1,0). 

If there is 132 cache lines in a cache, N \ = 132, then Cache Partition P = (1, 1, 100) 
is a Optimum Cache Partition of the MXM loop nest, and under the OCP cache re- 
placement policy the number of cache hits is estimated as following: 

H y = 990000+742500+990000 = 2722500. 



Ref 




<Hits, Cost> 




< Hits. Cost > 


X(k, i) 


(0,0,1) 


<990000, 1> 


(0,1,0) 


<742500,100 

> 


YU, k) 


(1,0,0) 


<990000,10000 

> 


(0,0,1) 


<742500, 1> 


Z(j, 0 


(0,1,0) 


<990000,100 


(0,0,1) 


<742500, 1> 



Fig. 6. Hits-cost pairs for a MXM program where U : =U 2 =U 3 = 100, cache line size Ls =16 
bytes, size of element in arrays is 4 bytes, then b f = 4. If there is 132 cache lines in a cache, 

then Cache Partition <P = (1, 1, 100) is a Optimum Cache Partition. 

The Optimum Cache Partition(OCP) of a loop nest is a compiler-assisted cache re- 
placement policy, that simplifies loop nest cache behavior, facilitates cache hits esti- 
mation, and decreases cache misses efficiently. 

The OCP policy is carried out by the compiler according to following three pri- 
mary steps: 1) Getting reuse vectors for loop nests. Reuse vectors can be calculated 
automatically according to some research works on loop optimization; 2) The Opti- 
mal Cache Partition is calculated automatically by the compiler; 3) Appending cache 
hints conditionally to realize the OCP policy. 

Appending cache hints to a reference Ref to limit the number of cache lines hold 
by the reference is a difficult work if the limited cache lines number C is chosen 
discretionarily. But under the Optimum Cache Partition replacement policy, the lim- 
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ited cache lines number for a reference, C, is calculated based on it's reuse vector, 
and the condition for appending cache hints is straightforward. 

Under OCP cache placement policy, if a reference Ref is limited to hold C cache 
lines, and is realized it's reuse vector r , then at any loop iteration point q , the condi- 
tion of appending a cache hint to a reference Ref is that the reference Ref cannot 
produce a new reuse along the reuse vector f at the iteration point q . Formally, 

when the point (q + r ) is out of the loop iteration space, the reference Ref would be 
appended a cache hint on iteration point q . 



5 Experimental Results 

5.1 Experimental Platform 

We have implemented OCP cache replacement policy in the Trimaran compiler and 
evaluated its performance by running some SPEC benchmarks. Trimaran is a com- 
piler infrastructure for supporting state of the art research in compiling for EPIC ar- 
chitectures. We have re-engineered the back-end of Trimaran. Implement of OCP 
cache replacement policy needed to update nothing but appending cache hints to 
some load/store instructions conditionally. 

In the Trimaran compiler infrastructure, cache behavior is simulated in DineroIV 
that is a trace-driven cache simulator. We have extended DineroIV to support cache 
hints in memory instructions. 

For a set associative cache, conflict misses may occur when too many data items 
map to the same set of cache locations. To eliminate or reduce conflict misses, we 
have implemented two data-layout transformations, inter- and intra-array padding[9]. 



5.2 Performance Results 

We chosen 8 benchmarks from SPECint2000, and implemented OCP cache replace- 
ment policy on their loop kernels. We experimented our approach on 4K bytes caches 
with 64 bytes line size and varying associativity(4-way and full). Cache miss rates 
under LRU cache replacement policy and that under OCP cache replacement policy 
are compared in Table 1. 

For a full associative cache, OCP policy is quite effective for all chosen bench- 
marks with average 24.04% cache misses reduction. For 4-way associative cache, 
OCP policy, reducing the number of cache misses by 16.59% averagely, is also quite 
effective except vpr and vortex benchmarks. The percentage reduction achieved on a 
4-way cache is lower than that achieved by a full associative cache. This could be due 
to conflict misses produced by a set associative cache. The vpr and vortex bench- 
marks were likely to produce more conflict misses under OCP policy than under LRU 
replacement policy, so their cache miss rates under OCP is greater than under LRU in 
Table 1. 
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Table 1 . Effective of OCP cache replacement policy in reducing cache misses. Row “LRU” 
reports cache miss rates under LRU replacement policy. Row “OCP” reports cache miss rates 
under OCP replacement policy. Row “Red.” Gives the percentage reduction in cache miss rates 
due to our approach. There is average 20.32% cache misses reduction. 



Cache 


Benchmarks’ cache miss rates (%) 


gzip 


vpr 


gee 


mef 


parser 


vortex 


bzip2 


twolf 


average 




LRU 


2.56 


5.72 


1.96 




2.24 


1.71 


1.18 






OCP 


2.13 


5.75 


1.61 




1.87 


1.77 


0.98 






Red. 


16 . 80 % 


- 0 . 52 % 


17 . 86 % 


41 . 88 % 


16 . 52 % 


- 3 . 51 % 


16 . 95 % 


26 . 78 % 


16 . 59 % 


Fun 


LRU 


2.53 


4.57 


1.86 


10.9 


1.94 


1.36 


1.04 


3.35 




OCP 




3.91 




6.32 


1.58 


1.18 


0.83 


2.25 




Red. 


28 . 46 % 


14 . 44 % 




42 . 02 % 


18 . 56 % 


13 . 24 % 


20 . 19 % 


32 . 84 % 


24 . 04 % 



6 Related Work 

Improving cache performance by cache hints has attracted a lot of attention from both 
the architecture and compiler perspective. In [10], keep and kill instructions are pro- 
posed by Jain et al. The keep instruction locks data into the cache, while the kill in- 
struction indicates it as the first candidate to be replaced. Jain et al. also proof under 
which conditions the keep and kill instructions improve the cache hits rate. In [11], it 
is proposed to extend each cache line with an EM(Evict Me)-bit. The bit is set by 
software, based on a locality analysis. If the bit is set, that cache line is the first can- 
didate to be evicted from the cache. These approaches all suggest interesting modifi- 
cations to the cache hardware, which allow the compiler to improve the cache re- 
placement policy. 

Kristof Beyls et al proposed a framework to generate both source and target cache 
hints from the reuse distance metric in this paper [13]. Since the reuse distance indi- 
cates cache behavior irrespective of the cache size and associativity, it can be used to 
make caching decisions for all levels of cache simultaneously. Two methods were 
proposed to determine the reuse distances in the program, one based on profiling 
which statically assigns a cache hint to a memory instruction and one based on ana- 
lytical calculation which allows to dynamically select the most appropriate hint. The 
advantage of the profiling-based method is that it works for all programs. The ana- 
lytical calculation of reuse distances is applicable to loop-oriented code and has the 
advantage that the reuse distance is calculated independent of program input and for 
every single memory access. But their work, generating cache hints from the reuse 
distance, does have the cyclic dependency problem [12] mentioned in Section 1: 
Accurate reuse distance estimation must be known to cache hint assignment, while 
accurate reuse distance estimation is only possible when cache hint assignment is 
finalized. They ignored the impact of cache hints on the reuse distance. 
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Hongbo Yang et al [12] studied the relationship between cache miss rate and 
cache-residency of reference windows, known that in order for an array reference to 
realize its temporal reuse, its reference window must be fully accommodated in the 
cache. And then they formulated the problem as a 0/1 knapsack problem. The rela- 
tionship between cache miss rate and cache-residency of reference windows is similar 
to the one considered in section 4.2, however our work has two major different as- 
pects from their work:(i) we achieved cache hits analysis under the OCP cache re- 
placement policy but they didn’t in their 0/1 knapsack problem; (ii) we considered the 
impact of temporal and spatial reuse vectors on cache miss rate, while they have only 
considered the impact of reference windows that are decided by temporal reuse vec- 
tors. 



7 Conclusions 

EPIC architectures provide cache hints to allow the compiler to have more control on 
the data cache behavior. In this paper we constructed a compiler-assisted cache re- 
placement policy. Optimum Cache Partition, which utilizes the cache more efficiently 
to achieve better performance. In particular, we presented a novel cache replacement 
policy, Cache Partition, which could be carried out through cache hints. Under the 
Cache Partition policy, we studied the relationship between cache hits rate and reuse 
vectors of a reference, and constructed hits-cost pairs of the reference, A hits-cost pair 
described a case: how many cache lines assigned to a reference could produce how 
many cache hits. After formulating cache hits number of a loop nest program under a 
Cache Partition, we could achieve Optimum Cache Partition replacement policy. To 
the best of our knowledge, the OCP cache replacement policy is the simplest effec- 
tive cache optimization with cache hints, that results in plain cache behaviors and 
makes cache misses analyzing and optimizing easily and efficiently. 

We evaluated our OCP cache replacement policy by implementing it in the Trima- 
ran compiler and simulating cache behaviors in DineroIV. Our simulation results 
show that OCP policy exploited the architecture potential well. It reduced the number 
of data cache misses by 20.32% averagely. 
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Abstract. Analytical modeling is one of the most interesting ap- 
proaches to evaluate the memory hierarchy behavior. Unfortunately, 
models have many limitations regarding the structure of the code they 
can be applied to, particularly when the path of execution depends on 
conditions calculated at run-time that depend on the input or intermedi- 
ate data. In this paper we extend in this direction a modular analytical 
modeling technique that provides very accurate estimations of the num- 
ber of misses produced by codes with regular access patterns and struc- 
tures while having a low computing cost. Namely, we have extended this 
model in order to be able to analyze codes with data-dependent condi- 
tionals. In a previous work we studied how to analyze codes with a single 
and simple conditional sentence. In this work we introduce and validate 
a general and completely systematic strategy that enables the analysis of 
codes with any number of conditionals, possibly nested in any arbitrary 
way, while allowing the conditionals to depend on any number of items 
and atomic conditions. 



1 Introduction 

Memory hierarchies try to cushion the increasing gap between the processor and 
the memory speed. Fast and accurate methods to evaluate the performance of 
the memory hierarchies are needed in order to guide the compiler in choosing the 
best transformations and parameters for them when trying to make the optimal 
usage of this hierarchy. Trace-driven simulation [1] was the preferred approach to 
study the memory behavior for many years. This technique is very accurate, but 
its high computational cost makes it unsuitable for many applications. This way, 
analytical modeling, which requires much shorter computing times than previous 
approaches and provides more information about the reasons for the predicted 

* This work has been supported in part by the Ministry of Science and Technology 
of Spain under contract TIC2001-3694-C02-02, and by the Xunta de Galicia under 
contract PGIDIT03-TIC10502PR. 
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behavior, has gained importance in recent years [2,3,4]. Still, it has important 
drawbacks like the lack of modularity in some models, and the limited set of 
codes that they can model. 

In this work we present an extension to an existing analytical model that 
allows to analyze codes with any kind of conditional sentences. The model was 
already improved in [5] to enable it to analyze codes with a reference inside a 
simple and single conditional sentence. We now extend it to analyze codes with 
any kind and number of conditional sentences, even with references controlled 
by several nested conditionals, and nested in any arbitrary way. Like in previous 
works, we require the verification of the conditions in the IF statements to follow 
an uniform distribution. 

This model is built around the idea of the Probabilistic Miss Equations 
(PMEs) [2]. These equations estimate analytically the number of misses gen- 
erated by a given code in set-associative caches with LRU replacement policy. 
The PME model can be applied both to perfectly nested loops and imperfectly 
nested loops, with one loop per nesting level. It allows several references per data 
structure and loops controlled by other loops. Loop nests with several loops per 
level can also be analyzed by this model, although certain conditions need to be 
fulfilled in order to obtain accurate estimations. This work is part of an ongoing 
research line whose aim is to build a compiler framework [6,7], which extracts 
information from the analytical modeling, in order to optimize the execution of 
complete scientific codes. 

The rest of the paper is organized as follows. The next section presents some 
important concepts to understand the PME model and our extension. Section 3 
describes the process of formulation after adapting the previous existing model 
to the new structures it has to model. In Sect. 4 we describe the process of 
validation of our model, using codes with several conditional sentences. A brief 
review of the related work is presented in Sect. 5, followed by our conclusions 
and a discussion on the future work in Sect. 6. 



2 Introduction to the PME Model 

The Probabilistic Miss Equations (PME) model, described in [2], generates ac- 
curately and efficiently cache behavior predictions for codes with regular access 
patterns. The model classifies misses as either compulsory or interference misses. 
The former take place the first time that the lines are accessed, while the lat- 
ter are associated to new accesses for which the corresponding cache line has 
been evicted since its previous access. The PME model builds an equation for 
each reference and nesting level that encloses the reference. This equation esti- 
mates the number of misses generated by the reference in that loop taking into 
account both kinds of misses. Its probabilistic nature comes from the fact that 
interference misses are estimated through the computation of a miss interference 
probability for every attempt of reuse of a line. 
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2.1 Area Vectors 

The miss probability when attempting to reuse a line depends on the cache 
footprint of the regions accessed since the immediately preceding reference to 
the considered line. The PME model represents these footprints by means of 
what it calls area vectors. Given a data structure V and a k- way set-associative 
cache, Sy = Sy 0 , Sy i , . . . , Sy k is the area vector associated with the access to 
V during a given period of the program execution. The i- th element, i > 0, of 
this vector represents the ratio of sets that have received k — i lines from the 
structure; while Sy 0 is the ratio of sets that have received k or more lines. 

The PME model analyzes the access pattern of the references for each dif- 
ferent data structure found in a program and derives the corresponding area 
vectors from the parameters that define those access patterns. The two most 
common access patterns found in the kind of codes we intend to model are the 
sequential access and the access to groups of elements separated by a constant 
stride. See [2] for more information about how the model estimates the area 
vectors from the access pattern parameters. 

Due to the existence of references that take place with a given probability 
in codes with data-dependent conditionals, a new kind of access pattern arises 
in them that we had not previously analyzed. This pattern can ben described 
as an access to groups of consecutive elements separated by a constant stride, 
in which every access happens with a given fixed probability. The calculation 
of the area vector associated to this new access pattern is not included in this 
paper because of size limitations. This pattern will be denoted as R r i(M , N , P,p), 
which represents the access to M groups of N elements separated by a distance 
P where every access happens with a given probability p (see example in Sect. 4). 

Very often, several data structures are referenced between two accesses to the 
same line of a data structure. As a result, a mechanism is needed to calculate 
the global area vector that represents as a whole the impact on the cache of 
the accesses to several structures. This is achieved by adding the area vectors 
associated to the different data structures. The mechanism to add two area 
vectors has also been described in [2], so although it is used in the following 
sections, we do not explain it here. The addition algorithm treats the different 
ratios in the area vectors as independent probabilities, thus disregarding the 
relative positions of the data structures in memory. This is in fact an advantage 
of the model, as in most situations these addresses are unknown at compile time 
(dynamically allocated data structures, physically indexed caches, etc.). This 
way, the PME model is still able to generate reasonable predictions in these 
cases, as opposed to most of those in the bibliography [3,4], which require the 
base addresses of the data structures in order to generate their estimations. Still, 
when such positions are known, the PME model can estimate the overlapping 
coefficients of the footprints associated with the accesses to each one of the 
structures involved, so they can be used to improve the accuracy of the addition. 
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DO I 0 =l, N 0 , L 0 
DO I 1 =l, N 1 , Iq 




IF cond(D(f D1 (I D1 ), 


• • • > f DdD (I DdD ):)) 


DO I Z =1, N z , L z 
A(f Al (I Al ) > •••’ 


f AdA (I AdA )) 


IF cond(B(f B1 (I B1 ), . ... fBdB^BdB^) 
c ( f Cl (I Cl) f CdC (I CdC )) 


END IF 




END DO 




END IF 




END DO 




END DO 





Fig. 1. Nested loops with several data-dependent conditions 



2.2 Scope of Application 

The original PME model in [2] did not support the modeling of codes with 
any kind of conditionals. Figure 1 shows the kind of codes that it can analyze 
after applying our extension. The figure shows several nested loops that have a 
constant number of iterations known at compile time. Several references, which 
need not be in the innermost nesting level, are found in the code. Some references 
are affected by one or more nested conditional sentences that depend on the 
data arrays. All the structures are indexed using affine functions of the loop 
indexes /a i( I ai) = &aiIai + &ai- We assume also that the verification of the 
conditions in the IF statements follows an uniform distribution, although the 
different conditions may hold with different probabilities. Such probabilities are 
inputs to our model that are obtained either by means of profiling tools, or 
knowledge of the behavior of the application. We assume also the conditions are 
independent. 

As for the hardware, the PME model is oriented to set-associative caches 
with LRU replacement policy. In what follows, we will refer to the total size of 
this cache as C s , to the line size as L s , and k will be the degree of associativity 
or number of lines per set. 

3 Miss Equations 

The PME model estimates the number of misses generated by a code using the 
concept of miss equation. Given a reference, the analysis of its behavior begins 
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in the innermost loop containing it, and proceeds outwards. In this analysis, a 
probabilistic miss equation is generated for each reference and in each nesting 
level that encloses it following a series of rules. 

We will refer as Iq (/?, Reglnput , p) to the miss equation for reference R in 
nesting level i. Its expression depends on Reglnput, the region accessed since the 
last access to a given line of the data structure. Since we now consider the exis- 
tence of conditional sentences, the original PME parameters have been extended 
with a new one, p. This vector contains in position j the probability p 3 that the 
(possible) conditionals that guard the execution of the reference R in nesting 
level j are verified. If no conditionals are found in level j, then p 3 = 1. When 
there are several nested IF statements in the same nesting level, pj corresponds 
to the product of their respective probabilities of holding their respective con- 
ditions. This is a first improvement with respect to our previous approach [5], 
which used a scalar because only a single conditional was considered. 

Depending on the situation, two different kinds of formulas can be applied: 

— If the variable associated to the current loop i does not index any of the 
references found in the condition(s) of the conditional(s) sentence(s), then 
we apply a formula from the group of formulas called Condition Independent 
Reference Formulas (CIRF). This is the kind of PME described in [2]. 

— If the loop variable indexes any of such references, then we apply a formula 
from the group called Condition Dependent Reference Formulas (CDRF). 

Another factor influencing the construction of a PME is the existence of other 
references to same data structure, as they may carry some kind of group reuse. 
For simplicity, in what follows we will restrict our explanation to references that 
carry no reuse with other references. 



3.1 Condition Independent Reference Formulas 

When the index variable for the current loop i is not among those used in the 
indexing of the variables referenced in the conditional statements that enclose 
the reference R, the PME for this reference and nesting level is given by 

Fi(R, Reglnput, p) =L m F i+1 (R, Reglnput, p) 

+ (Ni - L m )F i+1 (R,Reg(A,i,l),p) , 

being Ni the number of iterations in the loop of the nesting level i, and Lju the 
number of iterations in which there is no possible reuse for the lines referenced 
by R. Reg(A,i, j) stands for the memory region accessed during j iterations of 
the loop in the nesting level i that can interfere with data structure A. 

The formula calculates the number of misses for a given reference R in nesting 
level i, as the sum of two values. The first one is the number of misses produced 
by the Lm iterations in which there can be no reuse in this loop. The miss 
probability for these iterations depends on the accesses and reference pattern in 
the outer loops. The second value corresponds to the iterations in which there 
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can be reuse of cache lines accessed in the previous iteration, and so it depends 
on the memory regions accesses during one iteration of the loop. 

The indexes of the reference R are affine functions of the variables of the loops 
that enclose it. As a result, R has a constant stride Sm along the iterations of 
loop i. This value is calculated as Sm = aA j dA j , where j is the dimension whose 
index depends on I,, the variable of the loop; aAj is the scalar that multiplies 
the loop variable in the affine function, and dAj is the size of the j-th dimension. 
If does not index reference i?, then Sm = 0. This way, Lm can be calculated 
as, 



Lm — 1 + 



Nj- 1 

max{L s /S Ri , 1} 



(2) 



The formula calculates the number of accesses of R that cannot exploit either 
spatial or temporal locality, which is equivalent to estimating the number of 
different lines that are accessed during Nj iterations with stride Sm- 



3.2 Condition Dependent Reference Formulas 

If the index variable for the curret loop i is used in the indexes of the arrays used 
in the conditions that control the reference R, the behavior of R with respect 
to this loop is irregular. The reason is that different values of the index access 
different pieces of data to test in the conditions. This way, in some iterations 
the conditions hold and R is executed, thus affecting the cache, while in other 
iterations the associated conditions do not hold and no access of R takes place. 
As a result, the reuse distance for the accesses of R is no longer fixed: it depends 
on the probabilities p that the conditions that control the execution of R are 
verified. If the probabilities the different conditions hold are known, the number 
of misses associated to the different reuse distances can be weighted using the 
probability each reuse distante takes place. 

As we have just seen, eq. (2) estimates the number Lm of iterations of the 
loop in level i in which reference R cannot explote reuse. Since the loop has 
Nj iterations, this means on average each different line can be reused in up to 
Gm = Nj/Lm consecutive iterations. Besides, either directly reference R or the 
loop in level i + 1 that contains it can be inside a conditional in level i that 
holds with probability pj. Thus, PiLm different groups of lines will be accessed 
on average, and each one of them can be reused up to Gm times. Taking this 
into account, the general form of a condition-dependent PME is 

G m 

Fj(R, Reglnput , p) = PiL Ri ^ WMRj(R, Reglnput, j,p) . (3) 

j= i 

where WMRj (Reglnput, j, p) is the weighted number of misses generated by 
reference R in level i considering the j-th attempt of reuse of the Gm ones 
potentially possible. As in Sect. 3.1, Reglnput is the region accessed since the 
last access to a given line of the considered data structure when the execution 




50 



D. Andrade, B.B. Fraguela, and R. Doallo 



of the loop begins. Notice that if no condition encloses R or the loop around it 
in this level, simply p, = 1. 

The number of misses associated to reuse distance j weigthed with the prob- 
ability an access with such reuse distante does take place, is calculated as 

WMR; (Reglnput, j, p) =(1 - Pi(R,p)) J ~ 1 F i+1 (R, Reglnput U Reg {A,i,j - 1 ),pj + 

3-1 

P,(R,p)) k - 1 F i+1 (R,Reg(A,i,k- l),p) , 

k= 1 

( 4 ) 

where Pi(R,p) yields the probability that R accesses each of the lines it can 
potentially reference during one iteration of the loop in nesting level i. This 
probability is a function of those conditionals in p in or below the nesting level 
analyzed. The first term in (4) considers the case that the line has not been 
accessed during any of the previous j — 1 iterations. In this case, the Reglnput 
region that could generate interference with the new access to the line when the 
execution of the loop begins must be added to the regions accessed during these 
j — 1 previous iterations of the loop in order to estimate the complete interference 
region. The second term weights the probability that the last access took place 
in each of the k = 1, . . . ,j — 1 previous iterations of the considered loop. 



Line Access Probability. The probability Pi (R,p) that the reference R whose 
behavior is being analyzed does access one of the lines that belong to the region 
that it can potentially access during one iteration of loop i is a basic parameter 
to derive WMR; (Reglnput, j,p), as we have just seen. This probability depends 
not only on the access pattern of the reference in this nesting level, but also in 
the inner ones, so its calculation takes into account all the loops from the i-tli 
down to the one containing the reference. If fact, this probability is calculated 
recursively in the following way: 



Pi(R,p) 



' Pi 

p,P i+1 (R,p) 

- P i+ i(R,p)f Ri+ 



if i is the innermost loop 
that contains R 
if the index of loop i + 1 is 
not used in the references 
in conditions that control R 
x ) otherwise 



( 5 ) 



where we must remember that pi is the product of all the probabilities associated 
to the conditional sentences affecting R that are located in nesting level i. 

This algorithm to estimate the probability of access per line at level i has 
been improved with respect to our previous work [5] , as it is now able to integrate 
different conditions found in different nesting levels, while the previous one only 
considered a single condition. 
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posB=l 
DO 1=1 ,N 

offB(I)=posB 
DO J=1,M 

IF A(I , J) .NEQ.O 
B(posB)=A(I, J) 
jB(posB)=J 
posB=posB+l 
END IF 
ENDDO 
ENDDO 



Fig. 2. CRS Storage Algorithm 



DO 1=1, M 
DO K=1,N 

IF A(I,K) .NEQ.O 
DO J=1,P 

IF B(K, J) .NEQ.O 

C(I, J)=C(I, J)+A(I,K)*B(K, J) 
END IF 
ENDDO 
END IF 
ENDDO 
ENDDO 



Fig. 3. Optimized product of matrices 



3.3 Calculation of the Number of Misses 

In the innermost level that contains the reference R, both in CIRFs and CDRFs, 
•Fj+i(i2, Reglnput,p), the number of misses caused by the reference in the imme- 
diately inner level is AVo(Reglnput), this is, the first element in the area vector 
associated to the region Reglnput. 

The number of misses generated by reference R in the analyzed nest is finally 
estimated as Fq ( R, R,egInput total , p) once the PME for the outermost loop is 
generated. In this expression, RegInput total is the total region, this is, the region 
that covers the whole cache. The miss probability associated with this region is 
one. 

4 Model Validation 

We have validated our model by applying it manually to the two quite simple but 
representative codes shown in Fig. 2 and Fig. 3. The first code implements the 
storage of a matrix in CRS format (Compressed Row Storage), which is widely 
used in the storage of sparse matrices. It has two nested loops and a conditional 
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sentence that affects three of the references. The second code is an optimized 
product of matrices; that consists of a nest of loops that contain references inside 
several nested conditional sentences. 

Results for both codes will be shown in Sect. 4.2, but we will first focus on the 
second code in order to provide a detailed idea about the modeling procedure. 



4.1 Optimized Product Modeling 

This code is shown in Fig. 3. It implements the product between two matrix, A 
and B, with a uniform distribution of nonzero entries. As a first optimization, 
when the element of A to be used in the current product is 0, then all its products 
with the corresponding elements of B are not performed. As a final optimization, 
if the element of B to be used in the current product is 0 then that operation 
is not performed. This avoids two floating point operations and the load and 
storage of C(I,J). 

Without loss of generality, we assume a compiler that maps scalar variables 
to registers and which tries to reuse the memory values recently read in processor 
registers. Under these conditions, the code in Fig. 3 contains three references to 
memory. The model in [2] can estimate the behavior of the references A(I,K), 
which take place in every iteration of their enclosing loops. 

Thus, we will focus our explanation on the modeling of the behavior of the 
references C(I,J) and B(K, J) which are not covered in previous publications. 



C(I , J) Modeling. The analysis begins in the innermost loop, in level 2. In this 
level the loop variable indexes one of the reference of one of the conditions, so 
the CDRF formula must be applied. 

As Sr 2 = P, Lr 2 = 1 + N, Gr 2 — 1 and P 2 is the component in vector 
p associated to the probability that the condition inside the loop in nesting 
level 2 holds. This loop is in the innermost level. Thus, F^R, Reglnput, p) = 
A Vo (Reglnput), then after the simplification the formulation is, 

F 2 (R, Reglnput, p) = P 2 PA Vo (Reglnput) . (6) 

In the next upper level, level 1, the loop variable indexes one reference of one 
of the conditions, so the CDRF formula has to be applied. Let Sri = 0, Lri = 1 
and Gri ~ N, then 



N 

F 1 ( R , Reglnput, p) = p\ ^ WMRj ( R , Reglnput, j, p) . (7) 

j=i 

In order to compute WMRi we need to calculate the value for two functions. 
One is Pi(P,p), which for our reference takes the value P 1 P 2 , where Pi is the i-th 
element in vector p. The other one is Reg(C, l,i), the region accessed during i 
iterations of the loop 1 that can interfere with the accesses to C. 
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Reg(C, 1, i) =RrK Mo {P, 1, M, 1 - (1 - p lP2 y) 
UR r (i,l,M)UR rl (P,i,N, Pl ) . 

The first term is associated to the autointerference of C, which is the access 
to P groups of one element separated by a difference M and every access takes 
places with a given probability. The second term represents the access to i groups 
of 1 element separated by a distance M. The last element represents the access 
to P groups of i elements separated by a distance N. Every access is going to 
happen with a given probability P \ . 

In the outermost level, the loop variable indexes the reference of the con- 
dition. As a result, the CDRF formula is to be applied again. Being Sr o = 1, 
Lr 0 = 1 + [(N — 1)/L S \ and Grq — L s , so the formulation is 

l s 

Fo(R, Reglnput, p) = (1 + [(N - l)/L s \) Y WMR 0 (R, Reglnput, j,p) . (9) 

i= i 

As before, two functions must be evaluated to compute WMRq. They are 
P 0 (R,p) = 1 - (l-pip 2 ) N and Reg(C, 0,i), given by 

Reg(C, 0, i) =Rr tmto (P, 1, M, 1 - (1 - PlP2 ) N ) 

U R r {N, i, M) U Ri(PN, 1 — (1 — P \) Ls ) ■ 

The first term is associated to the autointerference of C, which is the access 
to P groups of one element separated by a difference M and every access takes 
places with a given probability. The second term represents the access to N 
groups of i elements separated by a distance M. The last element represents the 
access to PN consecutive elements with a given probability. 



B(K, J) Modeling. The innermost loop for this reference is also the one in level 
2. The variable that controls this loop, J, is found in the indexes of a reference 
found in the condition of an IF statements (in this case, the innermost one), 
one conditional, so a CDRF is to be built. As this is the innermost loop, we get 
Fs(R, Reglnput , p) = AV 0 RegInput. Since = N, L Ri = P and G Ri = 1 the 
formulation for this nesting level is 

F 2 {R, S (Reglnput) , p) = P A V 0 (Reglnput) . (11) 



The next level is level 1. In this level the variable of the loops indexes any 
of the reference of any of the conditional, so we have to use the CDRF formula. 
Being Sri = 1, L R1 = 1 + |_( N — 1)/L S \ and Gri ~ L s the formulation is 



Fi(R, Reglnput, p) = Pl 




N - 1 
L s 



L s 

Y, WMRi ( R , Reglnput, j, p) . (12) 

3 = 1 



We need to know Pi(R,p) = pi and the value of the accessed regions 
Reg(R, 1 ,i) in order to compute WMRi : 



Reg(-B, 1, i) = Rrl auto {P,i,N,p i]) U R r (i, 1,M) U R r i(P, 1, M, Pl p 2 ) ■ (13) 
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Table 1. Validation data for the code in Fig. 2 for several cache configurations and 
different problem sizes and condition probabilities 



M 


N 


P 


C s 


L s 


K 


MR 


Tsim 


T exe 


Tmod, 


6200 


10150 


0.4 


32K 


8 


4 


0.001 


82 


19 


0.001 


4200 


17150 


0.1 


4K 


4 


2 


0.401 


107 


18 


0.001 


16220 


7200 


0.2 


16K 


4 


2 


2.635 


152 


24 


0.044 


6200 


14250 


0.3 


32K 


8 


4 


0.005 


146 


22 


0.001 


9200 


14250 


0.1 


4K 


4 


8 


2.374 


582 


50 


0.001 


1100 


15550 


0.5 


4K 


4 


8 


0.027 


2 


1 


0.001 


2900 


17250 


0.3 


32K 


16 


4 


1.847 


65 


32 


0.001 


8900 


9250 


0.1 


64K 


8 


4 


3.055 


118 


46 


0.010 


4200 


12150 


0.1 


4K 


4 


2 


0.571 


64 


33 


0.001 


5000 


15000 


0.3 


32K 


8 


4 


0.183 


125 


54 


0.001 


7200 


12250 


0.1 


4K 


4 


8 


0.044 


139 


64 


0.010 



The first term is associated to the autointerference of B, which is the access to 
P groups of i elements separated by a difference N and every access takes places 
with a given probability. The second term represents the access to i groups of 
one element separated by a distance M. The last element represents the access 
to P groups of one element separated by a distance M, every access takes places 
with a given probability pip 2 - 

In the outermost level, the level 0, the variable of the loop indexes a reference 
in one of the conditions, so we have to apply again the CDRF formula. Being 
Sro = 0, Lrq = 1, Grq ~ M, so the formulation is 

M 

Fp(R, Reglnput,p) = WMRp(J?, Reglnput, j,p) . (14) 

3 = 1 

In this loop, WMRq is a function of P 0 {R,p= 1 — (1 — p\) Ls and the value 
of the accessed regions Reg(R, 0, i ): 

Reg(R, 0, i) =Ri aMo (PN, 1 - (1 - Pi) L ‘) U R r (N, i, M) 

,vr (15) 

UR rl {P,i,M,l-{l- PlP2 ) N ) . 

The first term is associated to the autointerference of B, which is the access to 
PN elements with a given probability. The second term represents the access to 
N groups of i elements separated by a distance M. The last element represents 
the access to P groups of i elements separated by a distance M, every access 
takes places with a given probability. 

4.2 Validation Results 

We have done the validation by comparing the results of the predictions given 
by the model with the results of a trace-driven simulation. We have tried sev- 
eral cache configurations, problem sizes and probabilities for the conditional 
sentences. 
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Table 2. Validation data for the code in Fig. 3 for several cache configurations and 
different problem sizes and condition probabilities 



M 


N 


P 


Pi 


P2 


C s 


L s 


K 


Amr 


Tsim 


T exe 


Tmod 


750 


750 


1000 


0.2 


0.1 


16K 


8 


8 


0.808 


35 


17 


0.075 


750 


750 


1000 


0.8 


0.3 


8K 


16 


16 


5.081 


164 


62 


0.053 


900 


850 


900 


0.8 


0.1 


16K 


32 


2 


0.224 


78 


54 


0.795 


900 


850 


900 


0.9 


0.1 


64I< 


8 


8 


0.589 


136 


66 


0.523 


900 


950 


1500 


0.8 


0.3 


16I< 


4 


2 


2.411 


236 


159 


0.110 


900 


950 


1500 


0.1 


0.4 


32I< 


8 


4 


5.408 


51 


38 


0.357 


1000 


850 


900 


0.7 


0.5 


4K 


8 


2 


4.394 


98 


97 


0.054 


200 


250 


150 


0.8 


0.2 


16K 


4 


2 


0.604 


1 


0 


0.690 


200 


250 


150 


0.1 


0.3 


32K 


8 


4 


2.161 


0 


0 


0.145 


200 


250 


150 


0.3 


0.1 


4I< 


4 


8 


1.208 


0 


0 


0.008 


100 


350 


90 


0.8 


0.5 


4I< 


4 


8 


0.070 


0 


1 


0.042 


100 


350 


90 


0.4 


0.4 


8K 


8 


4 


0.417 


0 


0 


0.324 


100 


350 


90 


0.2 


0.3 


4K 


8 


2 


0.744 


0 


0 


0.218 



Tables 1 and 2 display the validation results for the codes in Fig. 2 and 3, 
respectively. In Table 1 the two first columns contain the problem size and the 
third column stands for the probability p that the condition in the code is ful- 
filled. In Table 2 the first three columns contain the problem size, while the next 
two columns contain the probabilities p\ and p 2 that each of of the two condi- 
tions in Fig. 3 is fulfilled. Then the cache configuration is given in both tables by 
C s , the cache size, L s , the line size, and the degree of associativity of the cache, 
K. The sizes are measured in the number of elements of the arrays used in the 
codes. The accuracy of the model is used by the metric A MR , which is based on 
the miss rate (MR)-, it stands for the absolute value of the difference between 
the predicted and the measured miss rate. 

For every combination of cache configuration, problem size and probabilities 
of the conditions, 25 different simulations have been made using different base 
addresses for the data structures. 

The results show that the model provides a good estimation of the cache 
behavior in the two example codes. The last three columns in both tables reflect 
the corresponding simulation times, source code execution time and modeling 
times expressed in seconds and measured in a 2,08 Ghz AMD K7 processor- 
based system. We can see that the modeling times are much smaller than the 
trace-driven simulation and even execution times. Furthermore, modeling times 
are several orders of magnitude shorter than trace-driven simulation and even 
execution times. The modeling time does not include the time required to build 
the formulas for the example codes. This will be made automatically by the tool 
we are currently developing. According to our experience in [2], the overhead of 
such tool is negligible. 
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5 Related Work 

Over the years, several analytical models have been proposed to study the be- 
havior of caches. Probably the most well-known model of this kind is [8], based 
on the Cache Miss Equations (CMEs), which are lineal systems of Diophantine 
equations. Its main drawbacks are its high computational cost and that it is re- 
stricted to analyzing regular access patterns that take place in isolated perfectly 
nested loops. In the past few years, some models that can overcome some of these 
limitations have arisen. This is the case of the accurate model based on Pres- 
burger formulas introduced in [3], which can analyze codes with non-perfectly 
nested loops and consider reuses between loops in different nesting levels. Still, 
it can only model small levels of associativity and it has a extremely high com- 
putational cost. More recently [4], which is based on [8], can also analyze these 
kinds of codes in competitive times thanks to the statistical techniques it applies 
in the resolution of the CMEs. 

A more recent work [9], can model codes with conditional statements. Still, 
it does not consider conditions on the input or intermediate data computed by 
the programs. It is restricted to conditional sentences whose conditions refer to 
the variables that index the loops. 

All these models and others in the bibliography have fundamental differences 
with ours. One of the most important ones is that all of them require a knowledge 
about the base address of the data structures. In practice this is not possible or 
useful in many situations because of a wide variety of reasons: data structures 
allocated at run-time, physically-indexed caches, etc. Also, thanks to the general 
strategy described in this paper, the PME model becomes the first one to be 
able to model codes with data-dependent conditionals. 

6 Conclusions and Future Work 

In this work we have presented an extension to the PME model described in [2] . 
The extension allows this model to be the first one that can analyze codes with 
data-dependent conditionals and considering, not only simple conditional sen- 
tences but also nested conditionals affecting a given reference. We are currently 
limited by the fact that the conditions must follow an uniform distribution, but 
we think our research is an important step in the direction of broadening the 
scope of applicability of analytical models. Our validation shows that this model 
provides accurate estimations of the number of misses generated by a given code 
while requiring quite short computing times. In fact the model is typically two 
orders of maginute faster than the native execution of the code. 

The properties of this model turn it into an ideal tool to guide the opti- 
mization process in a production compiler. In fact, the original PME model has 
been used to guide the optimization process in a compiler framework [7]. We 
are now working in an automatic implementation of the extension of the model 
described in this paper in order to integrate it in that framework. As for the 
scope of the program structures that we wish to be amenable to analysis using 




Modeling the Cache Behavior 



57 



the PME model, our next step will be to consider codes with irregular accesses 
due to the use of indirections or pointers. 
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Abstract. This paper describes a novel Configurable System-on-Chip 
(CSoC) architecture for stream-based computations and real-time signal 
processing. It offers high computational performance and a high degree of 
flexibility and adaptability by employing a micro Task Controller (mTC) 
unit in conjunction with programmable and configurable hardware. The 
hierarchically organized architecture provides a programming model, al- 
lows an efficient mapping of applications and is shown to be easy scal- 
able to future VLSI technologies where over a hundred processing cells 
on a single chip will be feasible to deal with the inherent dynamics of 
future application domains and system requirements. Several mappings 
of commonly used digital signal processing algorithms and implementa- 
tion results are given for a standard-cell ASIC design realization in 0.18 
micron 6-layer UMC CMOS technology. 



1 Introduction 

We are currently experiencing an explosive growth in development and deploy- 
ment of embedded devices such as multimedia set-top-boxes and personal mobile 
computing systems which demands an increasing support of multiple standards 
[1,2]. This flexibility requirement points to the need of various communication, 
audio and video algorithms which differ in complexity. They have mostly a het- 
erogenous nature and comprise several sub-tasks with real-time performance 
requirements for data-parallel tasks [3]. A hardware which can cope with these 
demands needs different processing architectures: some are parallel, some are 
rather pipelined. In general, they need a combination. Moreover, various algo- 
rithms needs different levels of control over the functional units and different 
memory access. For instance, multimedia applications (like different video de- 
compression schemes) may include a data-parallel task, a bit-level task, irreg- 
ular computations, high-precision word operations and a real-time component 
[4]. The addressed requirements becomes even more relevant when Quality-of- 
Service (QoS) requirements e.g. varying the communication bandwidth in wire- 
less terminals, variable audio quality or a change from full color to black/white 
picture quality becomes a more important factor. A way to solve the flexibility 
and adaptability demands has been to use General Purpose Processors (GPP) 
or Digital Signal Processors (DSP), i.e. trying to solve all kinds of applications 
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running on a very high speed processor. A major drawback of using these general- 
purpose devices is that they are extremely inefficient in terms of utilizing their 
resources to best take advantage of data-level parallelism in the algorithms. 
Todays demands motivate the use of hybrid architectures which integrate pro- 
grammable logic together with different embedded resources and Configurable 
Systems-on-Clrip which can be realized by integrating reconfigurable (re-usable) 
and programmable hardware components. In contrast to processors, they totally 
lack programming models that would allow for device independent compilation 
and forward compatibility to other architecture families. 

The approach in this paper describes a Configurable System-on-Chip ap- 
proach with configurable and programmable properties. The architecture com- 
bines a wide variety of macro-module resources including a MIPS-like scalar pro- 
cessor core, coarse-grained reconfigurable processing arrays, embedded memories 
and custom modules supervised by a micro Task Controller. In the architecture, 
functions can be dynamically assigned to physical hardware resources such that 
the most efficient computation can be obtained. It can be forward compatible to 
other CSoC families with variable numbers of reconfigurable processing cells for 
different performance features. A major key issue for the CSoC system integra- 
tion includes the coupling of the macro-module resources for efficient mapping 
and transfer of data. The programming aspect is another important aspect in 
this paper. 

This work is organized as follows. Section 2 presents related work. Section 3 
the reconfigurable processing array which based on previous research activities. 
Section 4 introduces the CSoC architecture composition and the system control 
mechanism in detail. The next section (section 5) presents the programming 
paradigm. Algorithms mapping and performance analysis are present in Section 
6. Finally, section 7 discusses the design- and physical implementation while 
conclusions and future work are drawn in Section 8. 



2 Related Work 

There have been several research efforts as well as commercial products that have 
tried to explore the use of reconfigurable- and System-on-Chip architectures. 
They integrate existing components (IP-cores) into a single chip or explore com- 
plete new architectures. In the Pleiades project at UC Berkeley [5], the goal is to 
create a low-power high-performance DSP system. Yet, the Pleiades architecture 
template differs from the proposed CSOC architecture. In the Pleiades architec- 
ture a general purpose microprocessor is surrounded by a heterogeneous array of 
autonomous special-purpose satellite processors communicated over a reconfig- 
urable communication network. In contrast to the Pleiades system the proposed 
architecture offers reconfigurable hardware and data-paths supervised by a flex- 
ible controller unit with a simple instruction set which allows conditional recon- 
figuration and efficient hardware virtualization. The reconfigurable architecture 
CS2112 Reconfigurable Communication Processor [6] from Chameleon Systems 
couples a processor with a reconfigurable fabric composed of 32-bit processor 
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tiles. The fabric holds a background plane with configuration data which can 
be loaded while the active plane is in use. A small state-machine controls every 
tile. The embedded processor manages the reconfiguration and streaming data. 
The chameleon chip has a fixed architecture targeting communication applica- 
tions. The Configurable SoC architecture offers a micro Task Controller which 
allows variable handling of different processing resources and provides forward 
compatability with other CSoC architecture families for different performance 
demands. The scalar processor core is not involved in the configuration process. 
Furthermore, the configuration mechanism differs completely from the configu- 
ration approach used by Chameleon Systems. MorplroSys from the University of 
California Irvine [7] has a MIPS-like ’’Tiny RISC” processor with extended in- 
struction set, a mesh-connected 8 by 8 reconfigurable array of 28 bit ALUs. The 
’’TiniRISC” controls system execution by initiating context memory and frame 
buffer loads using extra instructions via a DMA controller. MorplroSys offers dy- 
namic reconfiguration with several local configuration memories. The suggested 
architecture model includes a micro Task Controller with a simple instruction 
set and a single local configuration memory in each cluster. It uses a pipelined 
configuration concept to configure multiple reconfigurable processing cells. The 
micro task program and the descriptor set can be reused in other CSoC families 
without a recompilation task. 



3 Background 

The Configurable System-on-Chip architecture approach build on previous 
research activities in identifying reconfigurable hardware structures and pro- 
viding a new hardware virtualization concept for coarse-grained reconfigurable 
architectures. A reconfigurable processing cell array (RPCA) has been designed 
which targets applications with inherent data-parallelism, high regularity and 
high throughput requirements [8]. The architecture is based on a synchronous 
multifunctional pipeline flow model using reconfigurable processing cells and 
configurable data-paths. A configuration manager allows run-time- and partial 
reconfiguration. The RPCA consists of an array of configurable coarse-grained 
processing cells linked to each other via broadcast- and pipelined data buses. 
It is fragmented into four parallel stripes which can be configured in parallel. 
The configuration technique based of an pipelined configuration process via 
descriptors. Descriptors represent configuration templates abutted to instruc- 
tion operation-codes in conventional Instruction Set Architectures (ISA). They 
can be sliced into fixed-size computation threads that, in analogy to virtual 
memory pages, are swapped onto available physical hardware within a few clock 
cycles. The architecture approach results in a flexible reconfigurable hardware 
component with performance and function flexibility. Figure 1 shows a cluster 
with overall 16 processing cells. 
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I/O Buses 




Fig. 1 . The structure of a cluster with the configuration manager, the processing cells 
with the switch boxes for routing, the dual-port scratch pad memories and the descrip- 
tor memory. In order to adjust configuration cycles, three pipeline registers for every 
stripe are implemented. 



Some important characteristics of the reconfigurable architecture are: 

— Coarse-grained reconfigurable processing model: Each reconfigurable pro- 
cessing cell contains a multifunctional ALU that operates on 48-bit or 2 * 24- 
bit data words (split ALU mode) on signed integer and signed fixed-point 
arithmetic. 

— Computation concept via compute threads: An application is divided in sev- 
eral computation threads which are mapped and executed on the processing 
array. 

— Hardware virtualization: Hardware on demand with a single context mem- 
ory and unbounded configuration contexts (descriptors) with low overhead 
context switching for high computational density. 

— Configuration data minimization: Reuse of a descriptor, to configure several 
processing cells, minimizes the configuration memory. 

— Library-based design approach: Library contains application-specific kernel 
modules composed of several descriptors to reduce the developing-time and 
cost. 

— Scalable architecture for future VLSI technologies: Configuration concept 
easy expandable to larger RPCA. 

The RPCA is proposed to be the reconfigurable module of the Configurable 
System-on-Chip design. It executes the signal processing algorithms in the ap- 
plication domain to enhance the performance of critical loops and computation 
intensive functions. 
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4 Architecture Composition 

The CSoC architecture model is hierarchically organized. It is composed of a 
micro Task Controller (mTC) unit which includes a set of heterogeneous pro- 
cessing resources and the reconfigurable processing cell array. Figure 2 gives 
a structure overview. To avoid that a global bus is becoming a bottleneck, a 
high-speed crossbar switch connects the heterogeneous processing resources and 
the RPCA. It allows multiple data transfers in parallel. Many algorithms in the 
applications domain contains abundant parallelism and are compute intensive. 
They are preferable spatially mapped onto the RPCA via a set of descriptors 
while the other parts can be computed with the heterogeneous processing re- 
sources [8]. An advantage of such a architecture partitioning is the more efficient 
mapping of algorithms as the Fast Fourier Transformation example in section 6 
illustrates. 



00 ^ ' 

Data 

Stream 




Data 

Stream 



Fig. 2. The hierarchically organized CSoC structure with the heterogeneous processing 
resources and the reconfigurable processing cell arrays are connected via a crossbar 
switch. The RCPA is partitioned in several clusters. 



4.1 Configurable System-on-Chip Overview 

The CSoC architecture consists of the mTC and two clusters of reconfigurable 
processing cells. Every cluster comprises sixteen 48-bit coarse-grained reconfig- 
urable processing cells. It offers variable parallelism and pipelining for differ- 
ent performance requirements. A two channel programmable DMA controller 
handles the data transfers between the external main memory banks and the 
processing cell array. Figure 3 outlines the architecture overview. 

In high performance systems high memory bandwidth is mandatory. Thus the 
architecture uses a bandwidth hierarchy to bridge the gap between deliverable 
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Fig. 3. The overall CSoC architecture with the mTC unit which includes the hetero- 
geneous processing resources, two clusters of reconfigurable processing cells with the 
scratch-pad memories (local RAM) and the Plasma scalar processor. 



off-chip memory bandwidth and the bandwidth necessary for the computation 
required by the application. The hierarchy has two levels: a main memory level 
(External Memory Bank) for large, infrequently accessed data realized as exter- 
nal memory block and a local level for temporary use during calculation realized 
as high-speed dual ported scratch-pad memories (local RAM). 

4.2 Processing Resource Implementation 

Due to the application field in digital signal, image and video processing appli- 
cations the heterogeneous processing resources comprises 

— two ALU (Arithmetic Logic Unit) blocks with four independent 48- or eight 
24-bit adder units, saturation logic and boolean function, 

— a multiplier block with two 24-bit signed/unsigned multipliers, 

— a 24-bit division and square root unit, 

— a MIPS-I compatible 32-bit Plasma scalar processor core, 

— a 256 * 24-bit lookup table with two independent Address Generation Units 
(AGU), 

— two register banks of 4 * 48-bit universal registers (R0 — R3 ; R0' — R3'). 

The 32-bit Plasma scalar processor core can be used for general-purpose process- 
ing or for tasks which can not be very well accomplished on the other processing 
resources. It can directly address the local RAMs of the reconfigurable process- 
ing cells. Bit computation is achievable with the ALU block. The lookup-table 
and the address generation unit can be used to construct e.g. a Direct Digital 
Frequency Synthesis (DDFS) generator which is frequently used in telecommu- 
nication applications. 
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4.3 System Control Mechanism and Instruction Set 

The micro Task Controller is a global micro-programmed control machine which 
offers a versatile degree of sharing the architecture resources. It manages the het- 
erogeneous processing resources and controls the configuration manager without 
retarding the whole system. The mTC executes a sequence of micro-instructions 
which are located in a micro-program code memory. It has a micro-program 
address counter which can be simply increment by one on each clock cycle to 
select the address of the next micro-instruction. Additionally, the micro Task 
Controller provides an instruction to initiate the configuration and control-flow 
instructions for branching. The following four instructions are implemented: 

— GOTO [Address]: Unconditional jump to a dedicated position in the micro- 
program code. 

— FTEST [Flag, Address]: Test a flag and branch if equal; If the flag is set 
(e.g. saturation flag), jump to a dedicated position in the micro-program 
code otherwise continue. 

— TEST [Value, Register, Address]: Test a 48/24-bit value and branch if equal; 
The instruction is equivalent to the FTEST instruction. It tests the content 
of an universal register. 

— CONFIG [Function Address]: Start address of the configuration context in 
the descriptor memory; The configuration manager fetches the descriptors 
in addiction to the descriptor type. The configuration process must only be 
triggered. The configuration manager notifies the termination of the config- 
uration process by a finish flag. 

The instruction set allows to control the implemented heterogeneous processing 
resources, the configuration manager and the data-streams. It allows conditional 
reconfiguration and data dependent branching. / / — then— else, while and select 
assignations can be easily modeled. 

5 Programming Paradigm 

An application is represented as a Control Data-Flow Graph (CDFG). In this 
graph, control (order of execution) and data are modeled in the same way. It 
conveys the potential concurrencies within an algorithm and facilitates the paral- 
lelization and mapping to arbitrary architectures. The graph nodes usually corre- 
spond to primitives, such as FFT, FIR- filter or complex multiplication (CMUL). 
For a given network architecture, the programmer starts with partitioning and 
mapping of the graph in micro-tasks (subtasks) , by allocate the flow graph nodes 
to the reconfigurable and heterogeneous processing resources in the system. It 
specifies the program structure: the sequence of micro-tasks that comprises the 
application, the connection between them and the input and output ports for 
the data streams. A micro-task processes input data and produces output data. 
After a task finishes, its output is typically input to the next micro-task. Figure 
4 shows an example composed of several micro-tasks with data feed-back. 
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Fig. 4. An example of a cascade of computation tasks with data feedback. A task may 
consists of a configuration and a processing part. 



After the partitioning and mapping, a scheduling process produces the sched- 
ule for the application which determines the order of the micro-task execution. 
The schedule for a computation flow is stored in a microprogram as a sequence 
of steps that directs the processing resources. It is represented as a mTC reser- 
vation table with a number of usable reservation time slots. The allocation of 
these time slots can be predefined by the scheduler process via a determined 
programming sequence. The sequence consists of micro-instructions which can 
be interpreted as a series of bit fields. A bit field is associated with exactly one 
particular micro operation. When a field is executed it’s value is interpreted to 
provide the control for this function during that clock period. Descriptor speci- 
fication: Descriptors can be used separately or application kernels can be chosen 
from a function library [8]. Plasma program: This is generated through the GNU- 
C language compiler. The compiler generates an MIPS-I instruction compatible 
code for the Plasma processor. The programming paradigm allows to find the 
maximum parallelism found in the application. Exploiting this parallelism allows 
the high computation rates which are necessary to achieve high performance. 

6 Mapping Algorithms and Performance Analysis 

This section discusses the mapping of some applications onto the Configurable 
System-on-Chip architecture. First, the Fast Fourier Transform (FFT) is mapped 
due to his high degree of important and tight real-time constraints for telecom- 
munication and multimedia systems. Then other typical kernels are mapped and 
some results becomes introduced. 

The Sande and Tukey Fast Fourier Transformation (FFT) algorithm with 
DIF (Decimation in Frequency) is chosen [9]. It can be more efficiently mapped 
on the CSoC hardware resources due to a better processing resource utilization. 
The butterfly calculation is the basic operation of the FFT. An implementation 
use a DIF-FFT radix-2 butterfly. It is shown in figure 5a). The radix-2 butterfly 
takes a pair of input data values ”X” and ”Y” and produces a pair of outputs 
”X ,n and T'” where 



k 

X'=X + Y , Y' = X -YW-. 

' AT 



(1) 
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In general the input samples as well as the twiddle factors are complex and can 
be expressed as 



X = X r +jX i , Y = Y r + jYi, (2) 

= e~^ 2 ' Kk ^ N = cos{2-Kk/N) — jsin(2Trk/N) (3) 

where the suffix r indicates the real part and the i the imaginary part of the data. 
In order to execute a 24-bit radix-2 butterfly in parallel, a set of descriptors for 
the RPCA and the mTC microcode for controlling the heterogeneous processing 
resources and the configuration manager are necessary. The butterfly operation 
requires four real multiplications and six real additions /subtractions. To map 
the complex multiplication, two multiplier and two multiplier/ add descriptors 
as configuration templates are needed. It can be mapped in parallel onto four 
processing cells of the RPCA. The complex- adder and subtracter are mapped 
onto the ALU block. The whole structure is shown in figure 5b). The pipelined 
execution of the radix-2 butterfly takes five clock cycles. The execution is best 
illustrated using the reservation table. It shows the hardware resource allocation 
in every clock cycle. Figure 6 shows the reservation table for the first stage of 
an n-point FFT fragmented into three phases. The initialize phase is comprised 
of static configurations of the processing resources. It initiates the configuration 
which maps four complex multiplier onto the RPCA by using the CONFIG in- 
struction. In the configuration phase, the configuration of the reconfigurable 
array is accomplished. The process phase starts the computation of the n-points. 
It is controlled by the FTEST instruction. The twiddle-factors W // are gener- 
ated for each stage via the lookup-table and the address generation units. 

The example in figure 7 shows the execution of the n-point radix-2 FFT. 
Each oval in the figure corresponds to the execution of a FFT compute stage 
(micro-task), while each arrow represents a data stream transfer. The exam- 
ple uses an in-place FFT implementation variant. It allows the results of each 
FFT butterfly to replace its input. This makes efficient use of the dual ported 
scratch-pad memories as the transformed data overwrites the input data. The re- 
order of the output data is simply arranged by reversing the address bits via the 
scratch-pad memory data sequencer unit (bit-reverse addressing). For demon- 
stration, a 64- point FFT computation, a 16x16 Matrix- Vector-Multiplication 
(MVM), a 32-tap systolic FIR-filter and 32-tap symmetrical FIR.-filter with 12- 
bit integer coefficients and data and a 32-tap real HR filter implementation are 
mapped. Generally larger kernels can be calculated by time-division multiple 
access (TDM A). 

One key to achieve high performance is keeping each functional unit as busy 
as possible. This goal can be quantified by efficiently mapping applications to 
the CSoC model including how to partition the kernel structure or large compu- 
tations. An important aspect is the resource occupancy of the architecture; the 
percentage of used hardware resources which are involved to process a kernel or 
an application. The occupancy of the architecture resources for several kernels 
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Fig. 5. The butterfly structure for the decimation in frequency radix-2 FFT in a). In 
b) the mapping of the complex adders onto the ALU block of the heterogeneous pro- 
cessing resources and the complex multiplier which are mapped onto the reconfigurable 
processing array. The implementation uses additional pipeline registers (Pipe Register) 
provided by the configuration manager (light hatched) for timing balance. 



are shown in figure 8 a). The hierarchically architecture composition allows to 
map four complex radix-2 butterflies in parallel onto the CSoC with a single 
cluster. As shown, the complex butterfly can be efficiently implemented due to 
the mapping of the complex adders to the heterogeneous processing resources 
and the complex multipliers to the reconfigurable processing cells. The resource 
unit occupancy of the cluster (Reconfigurable Resources) is 75%. The expensive 
multipliers in the reconfigurable processing cells are completely used. 

The 16x16 MVM kernel is mapped and calculated in parallel with 24-bit 
precision. The partial data accumulation is done via the adder resources in the 
ALU block. A 48-bit MVM calculation is not feasible due to limited broadcast 
bus resources in the cluster [8] . This results in a relative poor resource occupancy 
as a 24-bit multiplier in the reconfigurable processing cells can not be used. 
The FIR-filter structures can be mapped directly onto the RPCA. It can be 
calculated with 48-bit precision. As a result of insufficient adder resources in 
a reconfigurable processing cell, the 32-tap symmetrical FIR.-filter can only be 
calculated with 24-bit precision. The resource occupancy in a reconfigurable 
processing cell is poor as a 24-bit multiplier can not be used [8]. The HR- filter 
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Fig. 6. The simplified reservation table of the first FFT stage to process four com- 
plex butterflies in parallel. The hardware resources are listed on the left-hand side. 
First the time slots for the reconfigurable processing array (four parallel stripes) with 
the pipe stages (Stage), then the heterogeneous processing resources and the mTC 
control-instructions are illustrated. The hatched area marks the initialization phase. 
The reconfigurable processing array is configured after the fourth cycle. The Memory 
Data Sequencer Units (M-DSU) starts after the configuration process (time slot 5) to 
provide the input data. The heterogeneous and configurable processing resources are 
statically configured. 
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Fig. 7. A n-point FFT (n=max. 1024) is split onto several compute stages (micro- 
tasks) which are performed sequentially. The compute stages are separated into several 
parts in the reservation table. 



is composed of two parallel 32-tap systolic FIR-filters and an adder from the 
ALU block. It can only be calculated with full 48-bit precision by using TDMA 
processing. The lack of wiring resources (broadcast buses in a cluster) limits the 
resource load in a cluster as the MVM and the 32-tap symmetrical FIR-filter 
mapping illustrates. The wiring problem cannot always be completely hidden but 
can be reduced as shown in the case of the mapping of the complex butterfly. 
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Fig. 8. (a) Resource unit occupancy of selected kernels on the CSoC architecture with 
a single cluster. The dark area shows the cluster utilization while the striped area 
shows the functional unit utilization of the heterogeneous processing resources, (b) 
Performance results for the applications using a single cluster. The results do not 
include the configuration cycles, (c) Speedup results relative to a single cluster with 
16 reconfigurable processing cells for selected kernels. Plot (d) shows the CSoC ASIC 
prototype layout without the pad ring. 



Achieved performance results for some of the kernels above are summarized in 
figure 8 b). It shows the 24- and 48-bit realization of the MVM kernel. 

The 48-bit MVM computation needs twice as much clock cycles as the 24- 
bit realization due to the broadcast bus bottleneck. The other kernels can be 
calculated with 48-bit precision. The 32-tap FIR-filter can be executed onto the 
processing array by feeding back the partial data. The 64-point FFT uses four 
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complex butterflies in parallel. The 32-tap IIR-filter needs more then twice as 
much clock cycles due to the need of TDMA processing to compute the 32- 
tap FIR-filters in parallel. Speedups with reconfigurable processing cell scaling 
in four parallel stripes is shown in figure 8 c). The filter kernels have linear 
speedups to N = 24 because they can be directly mapped onto the RPCA. The 
MVM computation has linear speedup until N = 16. The increased number of 
processing cells gives an performance decrease due to the broadcast bus bottle- 
neck for a stripe. The FFT computation with the parallel implementation of the 
butterflies shows a resource bottleneck in the reconfigurable processing cells for 
N = 4 and N = 12. In this case, it is not possible to map complete complex 
multipliers in parallel onto the RPCA. In the case of N = 20, the broadcast bus 
bottleneck and a ALU adder bottleneck decrease the performance rapidly due to 
the absence of adder resources for the butterfly computation. The performance 
degradation in the N = 24 case results from the ALU adder resource bottle- 
neck, the complex multipliers can be mapped perfectly onto four parallel stripes 
with 24 processing cells. Scaling the number of clusters results in a near-linear 
speedup for the most kernels unless a resource bottleneck in the heterogeneous 
resources occurs. 

7 Design and Physical Implementation 

The CSoC architecture has been implemented using a standard cell design 
methodology for an UMC 0.18 micron, six metal layer CMOS (1.8V) process 
available through the european joined academic /industry project EUROPRAC- 
TICE. The architecture was modeled using VHDL as hardware description lan- 
guage. The RPCA and the mTC with the heterogeneous processing resources 
are simulated separately with the VHDL System- Simulator (VSS). They were 
then synthesized using appropriate timing constraints with the Synopsys Design- 
Compiler (DC) and mapped using Cadence Silicon Ensemble. The first imple- 
mentation includes a single cluster with 16 processing cells and the mTC unit 
with the heterogeneous processing resources. The Plasma scalar processor in- 
cludes a lk * 32 program cache. No further caches are implemented. The final 
layout, shown in figure 8 cl), has a total area of 10mm 2 . A cluster with 16 
processing cells needs 7.1 mm 2 silicon area, the Plasma processor core in 0.18 
micron CMOS technology approximately 0.4mm 2 . It can be clocked up to 210 
MHz. After a static timing analysis with Cadence Pearl , the CSoC design runs 
at clock frequencies up to 140MHz. 

8 Conclusions and Future Work 

A Configurable System-on-Chip for embedded devices has been described in this 
paper. The CSoC introduces an architecture for telecommunication-, audio-, and 
video algorithms in order to meet the high performance and flexibility demands. 
The architecture provides high computational density and flexibility of changing 
behaviors during run-time. The system consists of two clusters of reconfigurable 
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processing cells, heterogeneous processing resources and a programmable micro 
Task Controller unit with a small instruction set. The reconfigurable process- 
ing array has been designed for data-parallel and computation-intensive tasks. 
However, the CSoC architecture proposed here is different from many CSoC ar- 
chitectures in an important and fundamental way. It is hierarchically organized 
and offers a programming model. The architecture allows an efficient mapping 
of application kernels as the FFT mapping have illustrated and is scalable by 
either adding more reconfigurable processing cells in a cluster, by increasing the 
number of clusters or by adding more processing units to the heterogeneous re- 
sources. A major advantage of the CSoC architecture approach is to provide 
forward compatability with other CSoC architecture families. They may offer a 
different number of clusters or reconfigurable processing cells to provide variable 
performance requirements. The micro-task program and the descriptor set can 
be reused. A prototype chip layout with a single cluster has been developed using 
a UMC 0.18 micron 6-layer CMOS process. 

Apart from tuning and evaluating the CSoC architecture for additional ap- 
plications, there are several directions for future work. One key challenge is to 
extended the mTC instruction in order to enlarge the flexibility. Another chal- 
lenge is to create an automatic partitioning and mapping tool to assist the user. 
The programmable micro Task Controller and the reconfigurable processing ar- 
ray with the descriptors as configuration templates can be a solid base for such 
an intention. 
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Abstract. Previous works have shown that reconfigurable architec- 
tures are particularly well-adapted for implementing regular processing 
applications. Nevertheless, they are inefficient for designing complex 
control systems. In order to solve this drawback, microprocessors are 
jointly used with reconfigurable devices. However, only regular, modular 
and reconfigurable architectures can easily take into account constant 
technology improvements, since they are based on the repetition of 
small units. This paper focuses on the self-adaptative features of a new 
reconfigurable architecture dedicated to the control from the application 
to the computation level. This reconfigurable device can itself adapt its 
resources to the application at run-time, and can exploit a high level of 
parallelism into an architecture called RAMPASS. 

Keywords: dynamic reconfiguration, adaptative reconfigurable archi- 
tecture, control parallelism 



1 Introduction 

The silicon area of reconfigurable devices are filled with a large number of com- 
puting primitives, interconnected via a configurable network. The functionality 
of each element can be programmed as well as the interconnect pattern. These 
regular and modular structures are adapted to exploit future microelectronic 
technology improvements. In fact, semiconductor road maps [1] indicate that 
integration density of regular structures (like memories) increases faster than ir- 
regular ones (Tab. 1). In this introduction, existing reconfigurable architectures 
as well as solutions to control these structures, are first presented. This permits 
us to highlight the interests of our architecture dedicated to the control, which 
is then depicted in details. 

A reconfigurable circuit can adapt its features, completely or partially, to 
applications during a process called reconfiguration. These reconfigurations are 
statically or dynamically managed by hardware mechanisms [2]. These architec- 
tures can efficiently perform hardware computations, while retaining much of 
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Table 1 . Integration density for future VLSI devices 



Year 


1999 


2001 


2003 


2005 


2009 


2012 


Process (nm) 


180 


150 


130 


100 


70 


50 


DRAM (bit/chip) 


1,07 G 


1,7 G 


4,29 G 


17,2 G 


68,7 0 


275 G 


MPIJ 

(transistors/chip) 


21 M 


40 M 


76 M 


200 M 


520 M 


1,4 G 



the flexibility of a software solution [3]. Their resources can be arranged to im- 
plement specific and heterogeneous applications. Three kinds of reconfiguration 
level can be distinguished : 

— gate level: FPGA (Field Programmable Gate Array) are the most well- 
known and used gate-level reconfigurable architectures [4,5]. These devices 
merge three kinds of resources: the first one is an interconnection network, 
the second one is a set of processing blocks (LUT, registers, etc.) and the 
third one regroups I/O blocks. The reconfiguration process consists in using 
the interconnection network to connect different reconfigurable processing 
elements. Furthermore, each LUT is configured to perform any logical oper- 
ations on its inputs. These devices can exploit bit-level parallelism. 

— operator level: the reconfiguration takes place at the interconnection and 
the operator levels (PipeRench[6], DREAM [7], MorplroSys [8], REMARC 
[9], etc.). The main difference concerns the reconfiguration granularity, which 
is at the word level. The use of coarse-grain reconfigurable operators pro- 
vides significant savings in time and area for word-based applications. They 
preserve a high level of flexibility in spite of the limitations imposed by the 
use of coarse-grain operators for better performances, which do not allow 
bit-level parallelism. 

— functional level: these architectures have been developed in order to imple- 
ment intensive arithmetic computing applications (RaPiD [10], DART [11], 
Systolic Ring [12], etc.). These reconfigurable architectures are reconfigured 
in modifying the way their functional units are interconnected. The low 
reconfiguration data volume of these architectures makes it easier to imple- 
ment dynamic reconfigurations and allows the definition of simple execution 
models. 

These architectures can own different levels of physical granularity, and what- 
ever the reconfiguration grain is, partial reconfigurations are possible, allowing 
the virtualization of their resources. Thus for instance, to increase performances 
(area, consumption, speed, etc.), an application based on arithmetic operators 
is optimally implemented on word-level reconfigurable architectures. 

Besides, according to Amdahl’s law [13], an application is always composed 
of regular and irregular processings. It is always possible to reduce and optimize 
the regular parts of an application in increasing the parallelism, but irregular 
code is irreductible. Moreover, it is difficult to map these irregular parts on 
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reconfigurable architectures. Therefore, most reconfigurable systems need to be 
coupled with an external controller especially for irregular processing or dynamic 
context switching. Performances are directly dependent on the position and the 
level of this coupling. Presently, four possibilities can be exploited by designers: 

— microprocessor: this solution is often chosen when reconfigurable units are 
used as coprocessors in a SoC (System on Chip) framework. The micropro- 
cessor can both execute its own processes (and irregular codes) and configure 
its reconfigurable resources (PACT XPP-/Leon [14], PipeRench [15], Mor- 
plroSys [8], DREAM [7], etc.). These systems can execute critical processings 
on the microprocessor, while other concurrent processes can be executed on 
reconfigurable units. However, in order to only configure and execute irreg- 
ular code, this solution may be considered as too expensive in terms of area 
and energy consumption, and would most likely be the bottelneck due to 
off-chip communication overheads in synchronization and instruction band- 
width. 

— processor core: this approach is completely different since the processor 
is mainly used as a reconfigurable unit controller. A processor is inserted 
near reconfigurable resources to configure them and to execute irregular pro- 
cesses. Performances can also be increased by exploiting the control paral- 
lelism thanks to tight coupling (Matrix [16], Chimaera [17], NAPA [18], etc.). 
Marginal improvements are often noticed compared to a general-purpose 
microprocessor but these solutions give an adapted answer for controlling 
reconfigurable devices. 

— microsequencer: these control elements are only used to process irregular 
processing or to configure resources. They can be found in the RaPiD ar- 
chitecture, for instance, [10] as a smaller programmed control with a short 
instruction set. Furthermore, the GARP architecture uses a processor in or- 
der to only load and execute array configurations [19]. A microsequencer is 
an optimal solution in terms of area and speed. Its features do not allow 
itself to be considered as a coprocessor like the other solutions, but this ap- 
proach is however best fitted for specifically controlling reconfigurable units. 
Nevertheless, control parallelisms can be exploited with difficulty. 

— FPGA: this last solution consists in converting the control into a set of 
state machines, which could then be mapped to an FPGA. This approach 
can take advantage of traditional synthesis techniques for optimizing control. 
However, FPGA are not optimized for implementing FSM (Finite State Ma- 
chines) because whole graphs of the application must be implemented even if 
non-deterministic processes occur. Indeed, these devices can hardly manage 
dynamic reconfigurations at the state-level. 



Reconfigurable devices are often used with a processor for non-deterministic 
processes. To minimize control and configuration overheads, the best solution 
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consists in tightly coupling a processor core with the reconfigurable architec- 
ture [20]. However, designing for such systems is similar to a HW/SW co-design 
problem. In addition, the use of reconfigurable devices can be better adapted 
to deep sub-microelectronic technological improvements. Nonetheless, the con- 
troller needs other physical implementation features rather than operators, and 
FPGA can not always be an optimal solution for computation. Indeed, the con- 
trol handles small data and requires global communications to control all the 
processing elements, whereas the computation processes large data and uses lo- 
cal communications between operators. 

To deal with control for reconfigurable architectures, we have developed the 
RAMPASS architecture (Reconfigurable and Advanced Multi-Processing Archi- 
tecture for future Silicon Systems) [21]. It is composed of two reconfigurable 
resources. The first one is suitable for computation purposes but is not a topic 
of interest for this paper. The second part of our architecture is dedicated to con- 
trol processes. It is a self-reconfigurable and asynchronous architecture, which 
supports SIMD (Single Instruction Multiple Data), MIMD (Multiple Instruction 
Multiple Data) and multi-tlrreading processes. 

This paper presents the mechanisms used to auto-adapt resource allocations 
to the application in the control part of RAMPASS. The paper is structured 
as follows: section 2 outlines a functional description of RAMPASS. Section 3 
presents a detailed functional description of the part of RAMPASS dedicated 
to the control. This presentation focuses on some concepts presented in [21]. 
Then, section 4 depicts auto-adaptative reconfiguration mechanisms of this con- 
trol part. Finally, section 5 presents the development flow, some results and deals 
with the SystemC model of our architecture. 



2 Functional Description of RAMPASS 

In this section, the global functionality of RAMPASS is described. It is composed 
of two main reconfigurable parts (Fig. 1): 

— One dedicated to the control of applications (RAC: Reconfigurable Adapted 
to the Control); 

— One dedicated to the computation (RAO: Reconfigurable Adapted to Oper- 
ators). 

Even if the RAC is a part of RAMPASS , it can be dissociated to be integrated 
with other different architectures with any computational grain. Each computa- 
tion block can be either a general-purpose processor or a functional unit. The 
RAC is a generic control architecture and is the main interest of this paper. In 
this article, the RAO can be considered as a computational device adapted to 
the application, with a specific interface in order to communicate with the RAC. 
This interface must support instructions from the RAC and return one-bit flags 
according to its processes. 
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Fig. 1 . Organization of RAMPASS 



2.1 Overview 

From a C description, any application can be translated as a CDFG (Control 
Data Flow Graph), which is a CFG (Control Flow Graph) with the instructions 
of the basic blocks expressed as a DFG (Data Flow Graph). Thus, their partition 
is easily conceivable [22,23]. 

A CFG or a State Graph ( SG ) represents the control relationships between 
the set of basic blocks. Each basic block contains a set of deterministic instruc- 
tions, called actions. Thus, every state in a SG is linked to an action. Besides, 
every arc in a SG either connects a state to a transition, or a transition to a 
state. A SG executes by firing transitions. When a transition fires, one token 
is removed from each input state of the transition and one token is added to 
each output state of the transition. These transitions determine the appropri- 
ate control edge to follow. On the other hand, a DFG represents the overall 
corresponding method compiled onto hardware. 

Consequently, whatever the application is, it can be composed of two different 
parts (Fig. 2). The first one computes operations (DFG) and the second one 
schedules these executions on a limited amount of processing resources (CFG). 




Fig. 2. Partitioning of an application (a) in Control/Computation (b) 



The first block of our architecture can physically store any application de- 
scribed as a CFG. States drive the computation elements in the RAO , and events 
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coming from the RAO validate transitions in the SG. Moreover, self-routing 
mechanisms have been introduced in the RAC block to simplify SG mapping. 
The RAC can auto-implement a SG according to its free resources. The RAC 
controls connections between cells and manages its resources. All these mecha- 
nisms will be discussed in future sections. 



2.2 Mapping and Running an Application with RAMPASS 

In this part, the configuration and the execution of an application in RAMPASS 
are described. Applications are stored in an external memory. As soon as the SG 
begins to be loaded in the RAC, its execution begins. In fact, the configuration 
and the execution are simultaneously performed. Contrary to microprocessor, 
this has the advantage of never blocking the execution of applications, since the 
following executed actions are always mapped in the RAC. 

The reconfiguration of the RAC is self-managed and depends on the applica- 
tion progress. This concept is called auto-adaptative. The RAC Net has a limited 
number of cells, which must be dynamically used in order to map larger applica- 
tions. Indeed, due to a lack of resources, whole SGs can not always be mapped in 
the RAC. Dynamic reconfiguration has been introduced to increase the virtual 
size of the architecture. In our approach, no pre-divided contexts are required. 
Sub-blocks implemented in the RAC Net are continuously updated without any 
user help. Figure 3 shows a sub-graph of a 7-state application implemented at 
run-time in a 3-cell RAC according to the position of the token. 





j Sub-graph 
implemented in the 
RAC at the run-time 



Fig. 3. Evolution of an implemented SG in the RAC Net 



Each time a token is received in a cell of a SG implemented in the RAC, 
its associated instructions are sent to the RAO. When the RAO has finished its 
processes, it returns an event to the cell. This event corresponds to an edge in 
the SG mapped in the RAC. These transitions permit the propagation of tokens 
in SGs. Besides, each block has its synchronization mechanisms. In this globally 
asynchronous architecture, blocks are synchronized by 2-plrase protocols [24]. 
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It is possible to execute concurrently any parallel branches of a SG, or any 
independant SGs in the RAC. This ensures SIMD , MIMD , and multi-threading 
control parallelisms. Besides, semaphore and mutex can be directly mapped in- 
side the RAC in order to manage shared resources or synchronization between 
SGs. Even if SGs are implemented cell by cell, their instantiations are concurrent. 



3 Functional Description of the Control Block: The RAC 

As previously mentioned, the RAC is a reconfigurable block dedicated to the 
control of an application. It is composed of five units (Fig. 4). The CPL (Configu- 
ration Protocol Layer), the CAM (Content Addressable Memory) and the Leaf- 
Finder are used to configure the RAC Net and to load the Instruction Memory. 

3.1 Overview 

The RAC Net can support physical implementation of SGs. When a cell is con- 
figured in the RAC Net, its associated instructions are stored in the Instruction 
Memory as well as the address of its description in the CAM. Descriptions of 
cells are placed in a central memory and each description contains the instruction 
of the associated cell and the configuration of cells, which must be connected 
(daughter cells). In order to extend SGs in the RAC Net, the last cells of SGs, 
which are called leaf cells, are identified in the LeafFinder. These cells allow the 
extension of SGs. When a leaf cell is detected, a signal is sent to the CAM and 
the description of this cell is read in the central memory. From this description, 




[ ~| Used for configuration 



I I Used for execution 



Fig. 4. The RAC block 
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the daughter cells of this leaf cell are configured and links are established be- 
tween the cells in the RAC Net. The CAM can also find a cell mapped in the 
RAC Net thanks to its address. This is necessary if loop kernels try to connect 
already mapped cells. Finally, the propagation of tokens through SGs, thanks 
to events from the RAO , schedule the execution of instructions stored in the 
Instruction Memory. In the next section, the details of each block are given. 



3.2 Blocks Description 

RAC Net, this element is composed of cells and interconnect components. 
SGs are physically implemented thanks to these resources. One state of a SG is 
implemented by one cell. Each cell directly drives instructions, which are sent 
to the RAO. The RAC Net is dynamically reconfigurable. Its resources can be 
released or used at the run-time according to the execution of the application. 
Moreover, configuration and execution of SGs are fully concurrent. RAC Net 
owns primitives to ensure the auto-routing and the managing of its resources 
(cf §4.1). The RAC Net is composed of three one-lrot asynchronous FSMs (5 ,8 
and 2 states) to ensure the propagation of tokens, its dynamical destruction and 
the creation of connections. It represents about one thousand transistors in ST 
0.18/xm technology. 



Instruction memory, the Instruction Memory contains the instructions, 
which are sent by the RAC Net to the RAO when tokens run through SGs. 
An instruction can eventually be either configurations or context addresses. As 
shown in figure 5, the split instruction bus allows the support of EPIC (Ex- 
plicitly Parallel Instruction Computing) and the different kinds of parallelism 
introduced in the first section. Each column is reserved for a computation block 
in the RAO. For instance, the instructions A and B could be sent together to 
different computational blocks mapped in the RAO without creating conflicts, 
whereas the instruction C would be sent alone. A bit of selection is also used to 
minimize energy consumption by disabling unused blocks. 

Furthermore, each line is separately driven by a state, e.g. each cell of the 
RAC Net is dedicated to the management of one line of this memory. This mem- 
ory does not require address decoding since its access is directly done through 
its word lines. We call this kind of memory a word-line memory. 



CPL, this unit manages SG implementation in the RAC Net. It sends all the 
useful information to connect cells, which can auto-route themselves. It drives 
either a new connection if the next state is not mapped in the RAC net , or 
a connection between two states already mapped. It also sends primitives to 
release resources when the RAC Net is full. 



CAM, this memory links each cell of the RAC Net used to map a state of a 
SG, with its address in the external memory. Again, it can be driven directly 
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Fig. 5. Relation RAC Net /Inst ruction Memory 



through its word lines. It is used by the CPL to check if a cell is already mapped 
in the RAC Net. The CAM can select a cell in the RAC Net when its address 
is presented by the CPL at its input. Besides, the CAM contains the size of cell 
descriptions to optimize the bandwidth with the central memory. 



LeafFinder, this word-line memory identifies all the leaf cells. Leaf cells are 
in a semi-mapped state which does not yet have an associated instruction. The 
research is done by a logic ring, which runs each time a leaf cell appears. 



4 Auto-adaptative Reconfiguration Control 

The first part of this section deals with the creation of connections between cells 
and their configuration. A cell, which takes part in a SG, must be configured in 
a special state corresponding to its function in the SG. Finally, the second part 
focuses on the release of already used cells. 



4.1 Graph Creation and Configuration 

New connection. To realize a new connection e.g. a connection with a free 
cell, the CPL sends a primitive called connection. This carries out automatically 
a connection between an existing cell (the source cell), which is driven by the 
LeafFinder, and a new cell (the target cell), which is a free cell chosen in the 
neighborhood of the source cell. Thus, each daughter in the neighborhood of the 
source cell are successively tested until a free cell is found. The RAC Net and its 
network can self-manage these connections. In fact, carrying out a connection 
consists of validating existing physical connections between both cells. Finally, 
the path between the two cells can be considered as auto-routed in the RAC 
Net. 
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Connection between two existing cells. When the RAC finds the two 
cells, which must be connected, two primitives called preparation and search 
are successively sent by the CPL to the RAC Net. The first one initializes the 
research process and the second one executes it. The source cell is driven by 
the LeafFinder via the signal start, and the target cell by the CAM via the 
signal finish. According to the application, the network of the RAC Net can 
be either fully or partially interconnected. Indeed, the interconnection network 
area is a function of the square of the number of cells in the RAC Net. Thus, 
a fully connected network should be used only in highly irregular computing 
application. 

If the network is fully interconnected, the connection is simply done by the 
interconnect, which receives both the signals start and finish. On the other hand, 
if cells are partially interconnected, handshaking mechanisms allow the source 
cells to find the target. Two signals called find and found link each cells together 
(Fig. 6). On the reception of the signal search, the source cell sends a find signal 
to its daughters. The free cell receiving this signal sends it again to its daughters 
(this signal can be received only one time). So, the signal find spreads through 
free cells until it reaches the target cell. Then this cell sends back the signal 
found via the same path to the source cell. Finally, the path is validated and a 
hardware connection is established between the two cells. The intermediate and 
free cells, which take part in the connection, are in a special mode named bypass. 



Configuration. The dynamic management of cells is done by a signal called 
accessibility. This signal links every cell of a SG when a connection is done. Each 
cell owns an Up Accessibility ( CM) (from its mother cells) and a Down Accessi- 
bility (DA) (distributed to its daughter cells). At the time of a new connection, 
a cell receives the CM from its mother cells and stores its configuration coming 
from the CPL. In the case of multiple convergences (details on SG topologies 
have been presented in [21]), it receives the CM as soon as the first connection 
is established. Then, a configured cell is ready to receive and to give a token. 
After its configuration, the cell transmits its accessibility to its daughters. 



Preparation 



Search 




O Used cell 
O Free cell 
•• — Find signal 
■ — Found signal 



Fig. 6. Connection between the source cell (S) and its target cell (T) 
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When the connection has succeeded, the RAC Net notifies the CPL. Conse- 
quently, the CPL updates the CAM with the address of the new mapped state, 
the LeafFinder defines the new cell as a leaf cell, and the Instruction Memory 
stored the correct instructions. 

When a connection fails, the RAC Net indicates an error to the CPL. The 
CPL deallocates resources in the RAC Net and searches the next leaf cell with 
the LeafFinder. These two operations are repeated until a connection succeeds. 
Release mechanisms are detailed in the next paragraph. 



4.2 Graph Release 

A cell stops to deliver its accessibility when it no more receives an UA and does 
not own a token. When a cell loses its accessibility, all the daughter cells are 
successively free and can be used for other SG implementations. In order to 
prevent the release of frequently used cells, which may happen in loop kernels, 
a configuration signal called stop point can be used. 

Due to resource limitations, a connection attempt may fail. For this reason, 
a complete error management system has been developed. It is composed of 
three primitives, which can release more or less cells. The appearing frequency 
of connection errors is evaluated by the CPL. When predefined thresholds are 
reached, adapted primitives are sent to the RAC Net to free unused resources. 
The first one is called test acces. It can free a cell in a stop point mode (Fig. 7). 
Every cell between two stop point cells are free. Indeed, a stop point cell is free 
on a rising edge of the test acces signal when it receives the accessibility from 
its mothers. 




Fig. 7. Releasing of cells with test access 



The second release primitive is named reset stop point. It can force the liber- 
ation of any stop point cells when they do not have any token. This mode keeps 
cells implied in the implementation of loop kernels and reduces the release. In 
some critical cases (when resources are very limited), it can become an idle state. 

Finally, the last primitive called reset idle state guarantees no idle state in 
the RAC. This is done by freeing all the cells, which do not own a token. This 
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solution is of course the more efficient but is very expensive in time and energy 
consumption. It must only be used in case of repeated desallocation errors. 

No heuristics decide how many cells must be reclaimed or loaded. This is 
done automatically even if the desallocation is not optimal. That is why stop 
point cells must be adequately placed in SGs to limit releases. 

Non-deterministic algorithms need to make decisions to follow their pro- 
cesses. This can be translated as OR divergences, e.g. events determine which 
branch will be followed by firing transitions. To prevent speculative construc- 
tion and to configure too many unemployed cells, the construction of SGs is 
blocked until correct decisions are taken. This does not slow the execution of the 
application since the RAC Net contains always the next processes. Moreover, 
we consider that execution is slower than reconfiguration, and that an optimal 
computation time is about 3ns. Indeed, we estimate the reconfiguration time of 
a cell equals to 7.5 ns and the minimum time between two successive instructions 
for a fully interconnected network of 3ns + 1.5ns, where 1.5ns is the interconnect 
propagation time for a 32-cell RAC. 

5 Implementation and Performance Estimation 

An architecture can not be exploited without a development flow. For this rea- 
son, a development flow is currently a major research concern of our laboratory 
(Fig. 8). From a description of the application in C-language, an intermediate 
representation can be obtained by a front-end like SUIF [22,23]. Then, a paral- 
lelism exploration from the CDFG must be done to assign tasks to the multiple 
computing resources of the R A O. This parallelism exploration under constraints 
increases performances and minimizes the energy consumption and the memory 




Fig. 8. RAMPASS Development Flow Graph 
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bandwidth. The allocation of multiple resources in the RAO can also increase 
the level of parallelism. From this optimized CDFG , DFGs must be extracted in 
order to be executed on RAO resources. Each DFG is then translated into RAO 
configurations, thanks to behavioral synthesis scheme. This function is currently 
under development through the OSGAR project, which consists in designing 
a general-purpose synthesizer for any reconfigurable architectures. This RNTL 
project under the ward of the French research ministry, associates TNI-Valiosys, 
the Occidental Brittany University and the R2D2 team of the IRISA. On the 
other hand, a parser used to translate a CFG into the RAMPASS description 
language, has been successfully developed. 

Besides, a functional model of the RAC block has been designed with Sys- 
temC. Our functional-level description of the RAC is a CABA (Cycle Accurate 
and Bit Accurate) hardware model. It permits the change of the size and the 
features of the RAC Net and allows the evaluation of its energy consumption. 
The characteristics of this description language easily allows hardware descrip- 
tions, it has the flexibility of the C++ language and brings all the primitives for 
the modelization of hardware architectures [25,26]. 

A lot of different programming structures have been implemented in the 
RAC block, e.g. exclusion mechanisms, AND convergence and divergence, syn- 
chronizations between separated graphs, etc. Moreover, an application of video 
processing (spinal search algorithm for motion estimation [27]) has been mapped 
(Fig. 9). The latency overhead is insignificant without reconfiguration when the 




Event table: 

0 -‘ 1 ’ 

10 -Angle MB 

1 1 - Side MB 

12 — Core MB 

20 — End RAO Config 
30 — End Memory loading 

40 - SAD <= Min 

41 - SAD > Min 

42 - SAD <= Min 

43 — SAD > Threshold 

50 - NumMB > NbmaxMB 

5 1 - NumMB > NbmaxMB 



Fig. 9. Motion estimation graph 



Aii Auto-adaptative Reconfigurable Architecture for the Control 



85 



RAC owns 32 cells, or with a 15-cell RAC when the whole main loop kernel 
can be implemented (0.01%), even if we cannot predict reconfigurations. Finally 
with a 7-cell RAC (the minimal required for this application), the overhead 
raises only 10% in spite of multiple reconfigurations, since the implementation 
of the SG must be continuously updated. 

Besides, hardware simulations have shown the benefits of release primitives. 
Indeed, the more cells are released, the more the energy consumption increases 
since they will have to be re-generated, especially in case of loops. Simulations 
have shown that these releases are done only when necessary. 

Some SG structures implemented in the RAC Net need an imperative number 
of cells. This constraints the minimal number of cells to prevent dead- locks. For 
instance, a multiple AND divergence has to be entirely mapped before the token 
is transmitted. Consequently, an 8-state AND divergence needs at least nine 
cells to work. Dynamic reconfiguration ensures the progress of SGs but can not 
prevent dead-locks if complex structures need more cells than available inside 
the RAC Net. On the contrary, the user can map a linear SG of thousands of 
cells with only two free cells. 

6 Conclusion and Future Work 

New paradigm of dynamically self-reconfigurable architecture has been proposed 
in this paper. The part depicted is dedicated to the control and can physically 
implement control graphs of applications. This architecture brings a novel ap- 
proach for controlling reconfigurable resources. It can answer future technology 
improvements, allow a high level of parallelism and keep a constant execution 
flow, even for non-predictible processing. 

Our hardware simulation model has successfully validated static and dy- 
namic reconfiguration paradigms. According to these results, further works will 
be performed. To evaluate performances of RAMPASS, a synthesized model and 
a prototype of the RAC block is currently designed in a ST 0.18/jin technology. 

Moreover, the coupling between the RAC and other reconfigurable archi- 
tectures (DART, Systolic Ring, etc.) will be studied. The aim of these further 
collaborations consists in demonstrating the high aptitudes of the RAC to adapt 
itself to different computation architectures. 
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Abstract. The on-chip memory performance of embedded systems directly 
affects the system designers’ decision about how to allocate expensive silicon 
area. We investigate a novel memory architecture, flexible sequential and 
random access memory (FSRAM), for embedded systems. To realize sequential 
accesses, small “links” are added to each row in the RAM array to point to the 
next row to be prefetched. The potential cache pollution is ameliorated by a 
small sequential access buffer (SAB). To evaluate the architecture-level 
performance of FSRAM, we run the Mediabench benchmark programs [1] on a 
modified version of the Simplescalar simulator [2], Our results show that the 
FSRAM improves the performance of a baseline processor with a 16KB data 
cache up to 55%, with an average of 9%. We also designed RTL and SPICE 
models of the FSRAM [3], which show that the FSRAM significantly improves 
memory access time, while reducing power consumption, with negligible area 
overhead. 



1 Introduction 

Rapid advances in high-performance computing architectures and semiconductor 
technologies have drawn considerable interest to high performance memories. 
Increases in hardware capabilities have led to performance bottlenecks due to the time 
required to access the memory. Furthermore, the on-chip memory performance in 
embedded systems directly affects designers’ decisions about how to allocate 
expensive silicon area. Off-chip memory power consumption has become the energy 
consumption bottleneck as embedded applications become more data-centric. 

Most of the recent research has tended to focus on improving performance and 
power consumption of on-chip memory structures [4, 5, 6] rather than off-chip 
memory. Moon et al [7] investigated a low-power sequential access on-chip memory 
designed to exploit the numerous sequential access patterns in digital signal 
processing (DSP) applications. Prefetching techniques from traditional computer 
architecture have also been used to enhance on-chip memory performance for 
embedded systems [8, 9, 10]. Other studies have investigated energy efficient off-chip 
memory for embedded systems, such as automatic data migration for multi-bank 
memory systems [11]. 
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None of these previous studies, however, have investigated using off-chip 
memory structures to improve on-chip memory performance. This study demonstrates 
the performance potential of a novel, low-power, ofl'-chip memory structure, which 
we call the flexible sequential and random access memory (FSRAM), to support 
flexible memory access patterns. In addition to normal random access, the FSRAM 
uses an extra “link” structure, which bypasses the row decoder, for sequential 
accesses. The link structure reduces power consumption and decreases memory 
access times; moreover, it aggressively prefetches data into the on-chip memory. In 
order to eliminate the potential data cache pollution caused by prefetching, a small 
fully associative sequential access buffer (SAB) is used in parallel with the data 
cache. VHDL and HSPICE models of the FSRAM have been developed to evaluate 
its effectiveness at the circuit level. Embedded multimedia applications are simulated 
to demonstrate its performance potential at the architecture level. Our results show 
significant performance improvement with little extra area used by the link structures. 

The remainder of this paper is organized as follows. Section 2 introduces and 
explains the FSRAM and the SAB. In Section 3, the experimental setup is described. 
The architecture level performance analysis and area, timing and power consumption 
evaluations of the FSRAM are presented in Section 4. Section 5 discusses related 
work. Finally, Section 6 summarizes and concludes. 



2 Flexible Sequential Access Memory 

Our flexible sequential and random access memory (FSRAM) architecture is an 
extension of the sequential memory architecture developed by Moon, et al [7], They 
argued that since many DSP applications have static and highly predictable memory 
traces, row address decoders can be eliminated. As a result, memory access would be 
sequential with data accesses determined at compile time. They showed considerable 
power savings at the circuit level. 

While preserving the power reduction property, our work extends their work in 
two ways: (1) in addition to circuit-level simulations, we perform architectural-level 
simulations to assess the performance benefits at the application level; and (2) we 
extend the sequential access mechanism using a novel enhancement that increases 
sequential access flexibility. 

2.1 Structure of the FSRAM 

Fig. 1 shows the basic structure of our proposed FSRAM. There are two address 
decoders to allow simultaneous read and write accesses 1 . The read address decoder is 
shared by both the memory and the “link” structure. However, the same structure is 
used as the write decoder for the link structure, while the read/write decoder is 
required only for the memory. As can be seen, each memory word is associated with a 
link structure, an OR gate, a multiplexer, and a sequencer. 



1 Throughout the paper, all experiments are performed assuming dual-port memories. It is 
important to note that our FSAM does not require the memory to have two ports. The reason 
we chose two ports is that most modern memory architectures have multiple ports to improve 
memory latency. 
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Fig. 1. The FSRAM adds a link, an OR gate, a multiplexer, and a sequencer to each memory 

word. 

The link structure indicates which successor memory word to access when the 
memory is being used in the sequential access mode. With 2 bits, the link can point to 
four unique successor memory word lines ( e.g ., N+l, N+2, N+4, and N+8). This link 
structure is similar to the “next” pointer in a linked-list data structure. Note that Moon 
et al [7] hardwired the sequencer cell of each row to the row below it. By allowing 
more flexibility, and the ability to dynamically modify the link destination, the row 
address decoder can be bypassed for many more memory accesses than previous 
mechanisms to provide greater potential speedup. 
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Fig. 2. (a) Block diagram of the OR block, (b) block diagram of the sequencer. 

The OR block shown in Fig. 1 is used to generate the sequential address. If any of 
the four inputs to the OR block is high, the sequential access address (SA_WL) will 
be high (Fig. 2. a). Depending on the access mode signal (SeqAcc), the multiplexers 
choose between the row address decoder and the sequential cells. The role of the 
sequencer is to determine the next sequential address according to the value of the 
link (Fig. 2.b). If WL is high, then one of the four outputs is high. However if reset is 
high, then all four outputs go low irrespective of WL. The timing diagram of the 
signals in Fig. 2 is shown in our previous study [3]. 

The area overhead of the FSRAM consists of four parts - the link, OR gate, 
multiplexer, and sequencer. The overhead is in about the order of 3-7% of the total 
memory area for the word line size of 32 bytes and 64 bytes. More detailed area 
overhead results are shown in Table 2 in Section 4.2. 
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2.2 Update of the Link Structure 

The link associated with each off-chip memory word line is dynamically updated 
using data cache miss trace information and run-time reconfiguration of the sequential 
access target. In this manner, the sequentially accessed data blocks are linked when 
compulsory misses occur. Since the read decoder for the memory is the same physical 
structure as the write decoder for the link structure, the link can be updated in parallel 
with a memory access. The default link value of the link is 0, which actually means 
the next line (2° = 1). 

We note that the read and write operations to the memory data elements and the 
link RAM cells can be done independently. The word lines can be used to activate 
both the links and the data RAM cells for read or write (not all of the control signals 
are shown in Fig. 1). 

There are a number of options for writing the link values: 

1 . The links can be computed at compile-time and loaded into the data memory 
while instructions are being 

loaded into the instruction memory. 

2. The link of one row could be written while the data from another row is 
being read. 

3. The link can be updated while the data of the same row is being read or 
written. 

Option 1 is the least flexible approach since it exploits only static information. 
However, it could eliminate some control circuitry that supports the runtime updating 
of the links. Options 2 and 3 update the link structure at run-time and so that both 
exploit dynamic run-time information. Option 2, however, needs more run-time data 
access information compared to Option 3 and thus requires more control logic. We 
decided to examine Option 3 in this paper since the dynamic configuration of the 
links can help in subsequent prefetches. 



2.3 Accessing the FSRAM and the SAB 

In order to eliminate potential cache pollution caused by the prefetching effect of the 
FSRAM, we use a small fully associative cache structure, which we call the 
Sequential Access Buffer (SAB). In our experiments, the on-chip data cache and the 
SAB are accessed in parallel, as shown in Fig. 3. The data access operation is 
summarized in Fig. 4. When a memory reference misses in both the data cache and 
the SAB, the required block is fetched into the data cache from the off-chip memory. 
Furthermore, a data block pointed to by the link of the data word being currently read 
is pushed into the SAB if it is not already in the on-chip memory. That is, the link is 
followed and the result is stored in the SAB. When a memory reference misses in the 
data cache but hits in the SAB, the required block and the victim block in the data 
cache are swapped. Additionally, the data block linked to the required data block, but 
not already in on-chip memory, is pushed into the SAB. 
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Fig. 3. The placement of the Fig. 4. Flowchart of a data access when using the SAB and 
Sequential Access Buffer the FSRAM. 

(SAB) in the memory hierar- 
chy. 



3 Experimental Methodology 

To evaluate the system level performance of the FSRAM, we used SimpleScalar 3.0 
[2] to run the Mediabench [ 1 ] benchmarks using this new memory structure. The 
basic processor configurations are based on Intel Xscale [12], The Intel XScale 
microarchitecture is a RISC core that can be combined with peripherals to provide 
applications specific standard products (ASSP) targeted at selected market segments. 
The basic processor configurations are as the following: 32 KB data and instruction 
LI caches with 32-byte data lines, 2-way associativity and 1 cycle latency, no L2 
cache, and 50 cycle main memory access latency. The default SAB size is 8 entries. 
The machine can issue two instructions per cycle. It has a 32-entry load/store queue 
and one integer unit, one floating point unit, and one multiplication/division unit, all 
with 1 cycle latency. The branch predictor is bimodal and has 128 entries. The 
instruction and data TLBs are fully associative and have 32 entries. The link structure 
in the off-chip memory was simulated by using a large enough table to hold both the 
miss addresses and their link values. The link values are updated by monitoring the 
LI data cache miss trace. Whenever the gap between two continuous misses is lx, 2x, 
3x, 4x block line size, we update the link value correlated to the memory line that 
causes the first miss in the two continuous misses. 



3.1 Benchmark Programs 

We used the Mediabench [13] benchmarks ported to the SimpleScalar simulator for 
the architecture-level simulations of the FSRAM. We used four of the benchmark 
programs, adpcm, epic, g721 and mesa, for these simulations since they were the only 
ones that worked with the Simplescalar PISA instruction set architecture. 

Since the FSRAM link structure links successor memory word lines (Section 3.1), 
we show the counts of the address gap distances between two consecutive data cache 
misses in Table 1. We see from these results that the address gap distances of 32, 64, 
128, 256 and 512 bytes are the most common, while the other address gap distances 
occur more randomly. Therefore, the FSRAM evaluated in this study supports address 
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gap distances of 32, 64, 128 and 256 bytes for a 32-byte cache line, while distances of 
64, 128, 256 and 512 bytes are supported for a 64-byte cache line. 

For all of the benchmark programs tested, the dominant gap distances are between 
32 and 128 bytes. Most of the tested benchmarks, except g721, have various gap 
distances distributed among 32 to 256 bytes. When the gap increases to 512 bytes, 
epic and mesa still exhibit similar access patterns while adpcm and g721 have no 
repeating patterns at this gap distance. 



Table 1 . The frequencies (counts) of the various address distance gaps between two 
consecutive data cache misses for the tested benchmark programs 
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epic 

encode 
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g721 

encode 


g721 

decode 


mesa 

mipmap 
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121 


121 


167 
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78740 
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43 


93 


94 


9 


50896 


22809 


128 Bytes 


979 


979 


1864 


80 


0 


0 


5 


497 
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3237 
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36 


392 
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0 


14 


9 
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512 Bytes 


0 


0 


5 


896 


0 


0 


3 
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16457 



Another important issue for the evaluation of benchmark program performance is 
the overall memory footprint estimated from the cache miss rates. Table 2 shows the 
change in the L 1 data cache miss rates for the baseline architecture as the size of the 
data cache is changed. In general, these benchmarks have small memory footprints, 
especially adpcm and g721. Therefore, we chose data cache sizes in these simulations 
to approximately match the performance that would be observed with larger caches in 
real systems. The default data cache configuration throughout this study is 16 KB 
with a 32-byte line and 2-way set associativity. 



Table 2. The LI data cache miss rates for the baseline architecture with various LI cache sizes 





adpcm 

encode 


adpam 

decode 


encode 


decode 


g721 

encode 


g721 

decode 


mipmap 


osdemo 


Mesa 


2KB 


0.0214 


0.0174 


0.1424 


0.1248 


0.0010 


0.0013 


0.0894 


0.0207 


0.0735 


4KB 


0.001 


0.0011 


0.0703 


0.0612 


0.0003 


0.0004 


0.0444 


0.0173 


0.0337 


8KB 


0.0011 


0.0011 


0.0362 


0.0591 


0.0001 


0.0001 


0.0176 


0.0142 


0.0127 


16KB 


0.0010 


0.0010 


0.0162 


0.0569 


0.0000 


0.0000 


0.0086 


0.0123 


0.0068 


32KB 


0.0010 


0.0010 


0.0150 


0.0535 


0.0000 


0.0000 


0.0059 


0.0112 


0.0048 



3.2 Processor Configurations 

The following processor configurations are simulated to determine the performance 
impact of adding an FSRAM to the processor and the additional performance 
enhancement that can be attributed to the SAB. 

orig : This is the baseline architecture with no link structure in the off-chip 

memory and no prefetching mechanism. 

FSRAM: This configuration is described in detail in Section 3.1. To summarize, 
this configuration incorporates a link structure in the off-chip memory to exploit 
sequential data accesses. 

FSRAM_SAB: This configuration uses the FSRAM with an additional small, fully 
associative SAB in parallel with the LI data cache. The details of the SAB were given 
in Section 3.3. 
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tnlp: This configuration adds tagged next line prefetching [14] to the baseline 
architecture. With tagged next line prefetching, a prefetch operation is initiated on a 
miss and on the first hit to a previously prefetched block. Tagged next line prefetching 
has been shown to be more effective than prefetching only on a miss [15]. We use this 
configuration to compare against the prefetching ability of the FSRAM. 

tnlp_PB : This configuration enhances the tnlp configuration with a small, fully 
associative Prefetch Buffer (PB) in parallel with the LI data cache to eliminate the 
potential cache pollution caused by next line prefetching. We use this configuration to 
compare against the prefetching ability of the FSRAM_SAB configuration. 



4 Performance Evaluation 

In this section we evaluate the performance of an embedded processor with the 
FSRAM and the SAB by analyzing the sensitivity of the processor configuration 
FSRAM_SAB as the on-chip data cache parameters are varied. We also show the 
timing, area, and power consumption results based RTL and SPICE models of the 
FSRAM. 



4.1 Architecture-Level Performance 

We first examine the FSRAM_SAB performance compared to the other processor 
configurations to show the data prefetching effect provided by the FSRAM and the 
cache pollution elimination effect provided by the SAB. Since the FSRAM improves 
the overall performance by improving the performance of the on-chip data cache, we 
evaluate the FSRAM_SAB performance while varying the values for different data 
cache parameters including the cache size, associativity, block size, and the SAB size. 

Throughout Section 4.1, the baseline cache structure configuration is a 16 KB LI 
on-chip data cache with a 32-byte data block size, 2-way associativity, and an 8-entry 
SAB. The average speedups are calculated using the execution time weighted average 
of all of the benchmarks [16], 



Performance Improvement due to FSRAM. To show the performance obtained 
from the FSRAM and the SAB, we compare the relative speedup obtained by all four 
processor configurations described in Section 3.2 (i.e., tnlp, tnlp_PB, FSRAM, 
FSRAM_SAB) against the baseline processor configuration (prig). All of the processor 
configurations use a 16 KB LI data cache with a 32-byte data block size and 2-way 
set associativity. 

As shown in Fig. 5, the FSRAM configuration produces an average speedup of 
slightly more than 4% over the baseline configuration compared to a speedup of less 
than 1% for tnlp. Adding a small prefetch buffer (PB) to the tnlp configuration 
(tnlp_PB) improves the performance by about 1% compared to the tnlp configuration 
without the prefetch buffer. Adding the same size SAB to the FSRAM configuration 
(FSRAM_SAB) improves the performance compared to the FSRAM without the SAB 
by an additional 8.5%. These speedups are due to the extra small cache structures that 
eliminate the potential cache pollution caused by prefetching directly into the LI 
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cache. Furthermore, we see that the FSRAM without the SAB outperforms tagged 
next-line prefetching both with and without the prefetch buffer. The speedup of the 
FSRAM with the SAB compared to the baseline configuration is 8.5% on average and 
can be as high as 54% ( mesa_mipmcip ). 

Benchmark programs adpcm and g721 have very small performance 
improvements, because their memory footprints are so small that there are very few 
data cache misses to eliminate in a 16KB data cache (Table 2) Never the less, from 
the statistics shown in Fig. 5, we can still see adpcm and g721 follow the similar 
performance trend described above. 
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Fig. 5. Relative speedups obtained by the different processor configurations. The baseline is the 
original processor configuration. All of the processor configurations use a 16KB data LI cache 
with 32-byte block and 2-way associativity. 

Parameter Sensitivity Analysis. We are interested in the performance of FSRAM 
with different on-chip data cache to exam how the off-chip FSRAM main memory 
structure improves on-chip memory performance. So in this section we study the 
effects of various data cache sizes (i.e., 2KB, 4KB, 8KB, 16KB, 32KB), data cache 
associativities (i.e., lway, 2way, 4way, 8way), cache block sizes (i.e., 32 bytes, 64 
bytes) and the SAB sizes (i.e., 4 entries, 8entries, 16entreis) on the performance. The 
baseline processor configuration through this section is the original processor 
configuration with a 2KB data LI cache with 32-byte block size and 2-way 
associativity. 

The Effect of Data Cache Size. Fig. 6 shows the relative speedup distribution among 
orig, tnlp_PB and FSRAM_SAB for various LI data cache sizes (i.e., 2KB, 4KB, 
8KB, 16KB, 32KB). The total relative speedup is FSRAM_SAB with a LI data cache 
sizes over orig with a 2KB LI data, which is divided into three parts: the relative 
speedup of orig with a LI data cache size over orig with a 2KB LI data cache; the 
relative speedup of tnlp_PB with a LI data cache size over orig with a LI data cache 
size; the relative speedup of FSRAM_SAB with a LI data cache size over tnlp_PB 
with a LI data cache size. 

As shown, with the increase of LI data cache size the relative speedup of tnlp_PB 
over orig decreases. FSRAM_SAB, in contrast, constantly keeps speedup on top of 
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tnIp_PB across the different LI data cache sizes. Furthermore, FSRAM_SAB even 
outperforms tnlp_PB with a larger size LI data cache for most of the cases and on 
average. For instance, FSRAM with a 8KB LI data cache outperforms tnlp_PB with a 
32KB LI data cache. However, tnlp_PB only outperforms the baseline processor with 
a bigger size data cache for epic_decode and mesa_osdemo. 

The improvement in the performance can be attributed to several factors. While 
the baseline processor does not perform any prefetching, the tagged next line 
prefetching prefetches only the next word line. The fact that our method can prefetch 
with strides is one contributing factor in the smaller memory access time. 
Furthermore, prefetching is realized using sequential access, which is faster than 
random access. Another benefit is that prefetching with different strides does not 
require an extra large table to store the next address to be accessed. 

tnlp_PB and FSRAM_SAB improve performance in the case that the performance 
of orig increases with the increase of LI data cache size. However, they have little 
effect in the case that the performance of orig increases with the increase of LI data 
cache size, which means the benchmark program has small memory foot prints (i.e., 
adpcm, g721). For adpcm, tnlp and FSRAM_SAB still improve performance when the 
LI data cache size is 2K. For g721, the performance almost keeps the same all the 
time due to the small memory footprint. 



The Effect of Various Cache Sizes (2KB, 4KB, 8KB, 16KB, 32KB) 

I Oorig B lnlp_PB □ FSRAM_SAB I 




Fig. 6. Relative speedups distribution among the different processor configurations (i.e., orig, 
tnlp_PB, FSRAM_SAB) with various LI data cache sizes (i.e., 2KB. 4KB, 8KB, 16KB, 32KB). 
The baseline is the original processor configuration with a 2KB data LI cache with 32-byte 
block size and 2-way associativity. 



The Effect of Data Cache Associativity. Fig. 7 shows the relative speedup distribution 
among orig, tnlp_PB and FSRAM_SAB for various LI data cache associativity (i.e., 
lway, 2way, 4way, 8way). 

As known, increasing the LI data cache associativity typically reduces the number 
of LI data cache misses. The reduction in misses reduces the effect of prefetching 
from tnlp_PB and FSRAM_SAB. As can be seen, the performance speed up of 
tnlp_PB on top of orig decreases as the LI data cache associativity increases. The 
speed up almost disappears when the associativity is increased to 8way for 
mesa_mipmap and mesa_texgen. However, FSRAM_SAB still provides significant 
speedups. 

tnlp_PB and FSRAM_SAB still have little impact on the performance of adpcm 
and g721 because their small memory footprints. 
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The Effect of various Cache Associativities (1 way, 2way, 4way, 8way) 
I Dong ■tnlp_PB OFSRAM_SAB I 




Fig. 7. Relative speedups distribution among the different processor configurations (i.e., orig, 
tnlp_PB, FSRAM_SAB) with various Li data cache associativity (i.e., lway, 2way, 4way, 
8way). The baseline is the original processor configuration with a 2KB data LI cache with 32- 
byte block size and 2-way associativity. 




Fig. 8. Relative speedups distribution among the different processor configurations (i.e., orig, 
tnlp_PB, FSRAM_SAB ) with various LI data cache block sizes (i.e., 32B, 64B). The baseline is 
the original processor configuration with a 2KB data LI cache with 32-byte block size and 2- 
way associativity. 



The Effect of Data Cache Block Size. Fig. 8 shows the relative speedup distribution 
among orig, tnlp_PB and FSRAM_SAB for various LI data cache block sizes (i.e., 
32B, 64B). 

As known increasing the LI data cache block size typically reduces the number of 
LI data cache misses. For all of the benchmarks the reduction in misses reduces the 
effect of prefetching from tnlp_PB and FSRAM_SAB. As can be seen, the 
performance speed up of tnlp_PB on top of orig decreases as the LI data cache block 
size increases from 32-bytes to 64 bytes. However, the increasing of the LI data 
cache block size can also cause potential pollutions as for epic_encode and 
mesa_mipmap. Tnlp with a small prefetching buffer reduces the pollution, and 
FSRAM_SAB further speeds up the performance. 



The Effect of SAB Size. Fig. 9 shows the relative speedup distribution among orig, 
tnlp_PB and FSRAM_SAB for various SAB sizes (i.e., 4 entries, 8 entries, 16 
entries). 
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Fig. 9 compares the FSRAM_SAB approach to a tagged next-line prefetching that 
uses the prefetch buffer that is the same size as SAB. As shown, FSRAM_SAB 
always add speedup on top of tnlp_PB. Further, FSRAM_SAB outperforms tnlp with a 
bigger size prefetch buffer. This result indicates that FSRAM_SAB is actually a more 
efficient prefetching mechanism than a traditional tagged next-line prefetching 
mechanism. 

tnlp_PB and FSRAM_SAB still have little impact on the performance of adpcm 
and g721 because their small memory footprints. 



The Effect of Various SAB Sizes (4 entries, 8 entries, 16 entreis) 



□tnlp_PB ■ FSRAM_SAB 




Fig. 9. Relative speedup distribution among the different processor configurations (i.e., 
tnlp_PB, FSRAM_SAB) with various SAB sizes (i.e., 4 entries, 8 entries, 16 entries). The 
baseline is the original processor configuration with a 2KB data LI cache with 32-byte block 
size and 2-way associativity. 



4.2 Timing, Area, and Power Consumption 

We implemented the FSRAM architecture in VHDL to verify its functional 
correctness at the RTL level. We successfully tested various read/write combinations 
of row data vs. links. Depending on application requirements, one or two decoders 
can be provided so that the FSAM structure can be used as a dual-port or single-port 
memory structure. In all our experiments, we assumed dual-port memories since 
modern memory structures have multiple ports to decrease memory latency. 

In addition to the RTL level design, we implemented a small 8x8 (8 rows, 8 bits 
per row) FSRAM in HSPICE using 0.18|im technology to test timing correctness and 
evaluate the delay of sequencer blocks. Note that unlike the decoder, the sequencer 
block’s delay is independent of the size of the memory structure: it only depends on 
how many rows it links to (in our case: 4). 

By adding sequencer cells, we will be adding to the area of the memory structure. 
However, in this section we show that the area overhead is not large, especially 
considering the fact that in today’s RAMs, a large number of memory bits are 
arranged in a row. An estimate of the percentage increase in area was calculated using 

Al 

the formula ( 1)^100% where Al = Total Area and A2 = area occupied 

Al A2 

by the link, OR gate, MUX and the sequencer. Table 3 shows the results of the 
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increases in area for different memory row sizes. The sequencer has two SRAM bits, 
which is not many compared to the number of bits packed in a row of the memory. 
We can see that the sequencer cell logic does not occupy a significant area either. 



Table 3. Area overhead of FSRAM with various memory word line sizes 



No. of bits per row of memory 


Increase in area due to the MUX and the sequencer 


8 (1 byte) 


216% 


16 (2 bytes) 


119% 


64 (8 bytes) 


23.0% 


256 (32 bytes) 


7.12% 


512 (64 bytes) 


3.10% 



As can be seen, the percentage increase in area drops substantially as the number 
of bits in each word line increases. Hence the area overhead is almost negligible for 
large memory blocks. 

Using the HSPICE model, we compared the delay of the sequencer cell to the 
delay of a decoder. Furthermore, by scaling the capacity of the bit lines, we estimated 
the read/write delay and hence, calculated an overall speedup of 15% of sequential 
access compared to random access. 

Furthermore, the power saving is 16% in sequential access at VDD = 3.3v in the 
0.18 micron CMOS HSPICE model. 



5 Related Work 

The research related to this work can be classified into three categories: on-chip 
memory optimizations, off-chip memory optimizations, and hardware-supported 
prefetching techniques. 

In their papers. Panda et. al. [4, 5] address data cache size and number of 
processor cycles as performance metrics for on-chip memory optimization. Shiue et 
al. [6] extend this work to include energy consumption and show that it is not enough 
to consider only memory size increase and miss rate reduction for performance 
optimization of on-chip memory because the power consumption actually increases. 
In order to reduce power consumption, Moon et al. [7] designed an on-chip sequential 
access only memory specifically for DSP applications that demonstrates the low- 
power potential of sequential access. 

A few papers have addressed the issue of off-chip memory optimization, 
especially power optimization, in embedded systems. In a multi-bank memory system 
Dela Luz et al. [11] show promising power consumption reduction by using an 
automatic data migration strategy to co-locate the arrays with temporal affinity in a 
small set of memory banks. But their approach has major overhead due to extra time 
spent in data migration and extra power spent to copy data from bank to bank. 

Zucker et al. [10] compared hardware prefeching techniques adopted from 
general-purpose applications to multimedia applications. They studied a stride 
prediction table associated with PC (program counter). A data-cache miss-address- 
based stride prefetching mechanism for multimedia applications is proposed by 
Dela Luz et al. [11]. Both studies show promising results at the cost of extra on-chip 
memory devoted to a table structure of non-negligible size. Although low-cost hybrid 
data prefetching slightly outperforms hardware prefetching, it limits the code that 
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could benefit from prefetching [9]. Sbeyti et. al. [8] propose an adaptive prefetching 
mechanism which exploits both the miss stride and miss interval information of the 
memory access behavior of only MPEG4 in embedded systems. 

Unlike previous approaches, we propose a novel off-chip memory with little area 
overhead (3-7% for 32 bytes and 64 bytes data block line) and significant 
performance improvements, compared to previous works that propose expensive on- 
chip memory structures. Our study investigated off-chip memory structure to improve 
on-chip memory performance, thus leaves flexibility for designer’s to allocate 
expensive on-chip silicon area. Furthermore, we improved power consumption of off- 
chip memory. 



6 Conclusions 

In this study, we proposed the FSRAM mechanism that makes it possible to eliminate 
the use of address decoders during sequential accesses and also random accesses to a 
certain extent. 

We find that FSRAM can efficiently prefetch the linked data block into on-chip 
data cache and improve performance by 4.42% on average for an embedded system 
using 16KB data cache. In order to eliminate the potential cache pollution caused by 
the prefetching, we used a small fully associative cache called SAB. The experiments 
show FSRAM can further improve the tested benchmark programs performances to 
8.85% on average using the SAB. Compared to the tagged next-line prefetching, 
FSRAM_SAB constantly performs better and can still speedup performance when 
tnlp_PB cannot. This indicates that FSRM_SAB is more efficient prefetching 
mechanism. 

FSRAM has both sequential accesses and random accesses. With the expense of 
negligible area overhead (3-7% for 32 bytes and 64 bytes data block line) from the 
link structure, we obtained a speedup of 15% of sequential access over random access 
from our designed RTF and SPICE models of FSRAM. Our design also shows that 
sequential access save 16% power consumption. 

The link structure/configuration explored in this paper is not the only way; a 
multitude of other configurations can be used. Depending upon the requirement of an 
embedded application, a customized scheme can be adopted whose level of flexibility 
during accesses best suits the application. For this, prior knowledge of access patterns 
within the application is needed. In the future, it would be useful to explore power- 
speed trade-offs that may bring about a net optimization in the architecture. 
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Abstract. One of many technical challenges facing the designers of reconfigur- 
able systems is how to integrate hardware and software resources. The problem 
of allocating each application function to general purpose processors (GPPs) 
and Field Programmable Gate Array (FPGAs) considering the system resource 
restriction and application requirements becomes harder. We propose a solution 
employing Y-chart design space exploration approach to this problem and de- 
velop Y-Sim, a simulation tool employing the solution. Its procedure is as fol- 
lows: First, generate the mapping set by matching each function in a given ap- 
plication with GPPs and FPGAs in the target reconfigurable system. Secondly, 
estimate throughput of each mapping case in the mapping set by simulation. 
With the simulation results, the most efficient configuration achieving the 
highest throughput among the mapping cases would be chosen. We also pro- 
pose HARMS (Heuristic Algorithm for Reducing Mapping Sets), a heuristic 
algorithm minimizing the mapping set by eliminating unnecessary mapping 
cases according to their workload and parallelism to reduce the simulation time 
overhead. We show the experimental results of proposed solution using Y-Sim 
and efficiency of HARMS. The experiment results indicates that HARMS can 
minimize the mapping set by 87.5% and most likely pick out the mapping case 
with the highest throughput. 



1 Introduction 

When one considers implementing a certain computational task, obtaining the highest 
performance can be achieved by constructing a specialized machine, i.e., hardware. 
Indeed, this way of implementation exists in the form of Application-Specific Inte- 
grated Circuits (ASICs). As, however, many reconfigurable devices such as Field 
Programmable Gateway Array(FPGA) are developed and improved, many computa- 
tional tasks can be implemented, or configured, on those devices almost at any point 
by the end user. Their computation performance has not exceeded ASIC performance 
but could surpass general purpose processors by factor of several hundreds depending 
on applications. In addition to the performance gain, reconfigurable devices have the 
advantage of flexibility, contrary to the fact the structure of ASIC cannot be modified 
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after fabrication. Moreover, many independent computational units or circuits can be 
implemented on a single FPGA within its cell and connection limits and many FPGAs 
can be organized as a single system. 

This novel technology brings about a primary distinction between programmable 
processors and configurable ones. The programmable paradigm involves a general- 
purpose processor, able to execute a limited set of operations, known as the instruc- 
tion set. The programmer's task provids a description of the algorithm to be carried 
out, using only operations from this instruction set. This algorithm need not necessar- 
ily be written in the target machine language since compilation tools may be used; 
however, ultimately one must be in possession of an assembly language program, 
which can be directly executed on the processor in question. The prime advantage of 
programmability is the relatively short turnover time, as well as the low cost per ap- 
plication, resulting from the fact that one can reprogram the processor to carry out 
any other programmable task[l,3]. 

Reconfigurable Computing which promises performance and flexibility of systems 
has emerged as a significant area of research and development for both the academic 
and industrial communities. Still, there are many technical challenges facing the de- 
signers of reconfigurable systems. These systems are truly complex when we face 
how to integrate hardware and software resources[12]. Unfortunately, current design 
environments and computer aided design tools have not yet successfully integrated 
the technologies needed to design and analyze a complete reconfigurable systemfl 1], 

We propose a solution employing Y-chart design space exploration approach to 
this problem assuming pre-configuration policy and Y-Sim, a simulation tool for 
evaluating performance of the system. Its procedure is as follows: First, it generates 
the mapping set by matching each functions in given application with GPPs and 
FPGAs and estimates throughput of each mapping case in the mapping set by simula- 
tion. With the simulation results, the most efficient configuration achieving the high- 
est throughput among the mapping cases would be chosen. During the simulation, Y - 
Sim resolves resource conflict problem arising when many subtasks try to occupy the 
same FPGA. 

We also propose HARMS (Heuristic Algorithm for Reducing Mapping Sets), a 
heuristic algorithm reducing the huge size of a mapping set. Although each of map- 
ping cases should be simulated to find an appropriate configuration, the elimination 
of unnecessary mapping cases according to the workload and parallelism helps to 
reduce the simulation time overhead significantly. 

Section 2 summarizes previous related researches. Section 3 describes Y-Sim 
which employs Y-chart design space exploration approach. Section 4 explains 
HARMS and section 5 shows its efficiency with the minimization effect on simula- 
tion. Section 6 draws conclusions from the solution proposal and reduction algorithm. 



2 Related Researches 

Lee et al proposed reconfigurable system substructure targeting multi-application 
embedded system which executes various programs, such as PDA (Personal Digital 
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Assistant) and IMT2000 (International Mobile Telecommunication). They also pro- 
posed dynamic FPGA configuration model which interchanges several func- 
tional units on the same FPGA[5], 

Keinhuis proposed Y-chart design space exploration[4]. This approach quantifies 
important performance metrics of system elements on which given applications will 
be executed. Keinhuis developed ORAS(Object Oriented Retargetable Architecture 
Simulator), a retargetable simulator, based on Y-chart approach[6]. ORAS calculates 
performance indicators of possible system configurations through simulation. How- 
ever, simulation of reconfigurable system needs different simulation substructure 
other than ORAS since ORAS is limited to a specific calculation model of Stream- 
based Function. 

Kalavade et al investigate the optimization problem in the design of multi-function 
embedded systems that run a pre-specified set of applications. They mapped given 
applications onto architecture template which contains one or more programmable 
processors, hardware accelerators and coprocessors. Their proposed model of hard- 
ware-software partitioning for multi-function system identifies sharable application 
functions and maps them onto same hardware resources. But, in this study, target 
applications is limited to a specific domain and FPGA is not considered[8]. 

Pruna et al. partitions huge applications into subtasks employing small size recon- 
figurable hardware and schedules data flows. Their study, however, is limited to one 
subtask - one FPGA configuration and multi-function FPGA is not considered[9]. 



3 Partitioning Tool 

3.1 Design of Hardware Software Partitioning Tool 

Y-Sim, a partitioning tool employing Y-chart approach, is designed to perform hard- 
ware-software partitioning for reconfigurable systems. This tool breaks down given 
application into subtasks and maps functions of each subtask and determines configu- 
ration of elements in the reconfigurable system. Performance indicators of each con- 
figuration is calculated to choose the best configuration. The same architecture model 
and application model in the Y-chart approach are employed and shown in Table 1. 

The architecture model contains architecture elements(AEs) which indicate proc- 
essing units such as general purpose processors and FPGAs. Each architecture ele- 
ment has its own AE_ID, and, in the case of FPGA, their maximum number of usable 
cells (Usable_Resource) and functional element (FE). Functional elements have re- 
source request information which indicates the number of cells needed to configure 
the function. Functional elements also have execution time information which is 
needed for executing each data unit. In the case of FPGAs, those functions don't inter- 
fere others but do in the case of general purpose processors. Exclusion is considered 
in functional elements executed by general purpose processors. Fig. 1 shows an ex- 
ample of hardware model which has three architecture elements, each of which has 
two functional elements. 
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Table 1. Architecture Model and Application Model 



Architecture Model Application Model 



Application := List_of_Pe 

List_of_PE := PE | List_of_PE PE 



Architecture := List_of_AE 
List_of_AE := AE | List_of_AE AE 
AE:=AE_ID Usable_Resources List_of_FE 
List_of_FE := FE | List_of_FE FE 
FE:=FE_ID Resource_Request 
Execution_Time Exclusion 



PE:=PE_ID Next_PE_list FEQ_list 
Next_PE_list := PE_ID | Next_PE_list PE_ID 
FEQJist := FEQ | FEQJist FEQ 



Architecture Model 



Architecture Element 
AEJD=1 

Usable Resource =9000 

Functional Element 
FEJD=1 

Resource Request=3000 
Execution Time=100ms 
Exclusion 



Functional Element 
FEJD=2 

Resource Request=5000 
Execution Time=120 ms 
Exclusion 



Architecture Element 
AEJD=2 

Usable Resource=6000 

Functional Element 
FE_/D=2 

Resource Request=5000 
Execution Time= 120ms 



Functional Element 
FEJD=3 

Resource Request=1000 
Execution Time=50ms 



Architecture Element 
AEJD=3 

Usable Resource=6000 

Functional Element 
FEJD=4 

Resource Request=3000 
Execution Time=80ms 



Functional Element 
FEJD=5 

Resource Request=3000 
Execution Time=60ms 



Fig. 1. An Example of Hardware Model 
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Fig. 2. An Example of Application Model 
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An application model is represented by processing elements(PE), subtasks of ap- 
plication and the direction of data flow indicated by Next_PE_list. The flow of data 
and the sequence of Processing Elements must have no feedback and no loop. Each 
process element has its own identifier, PE_ID and its function specified by FEQ_list. 
Fig. 2 shows an example of application model which has four process elements. 

In Y-chart approach, mapping of Y-chart is done by generating mapping using ar- 
chitecture model and application model. A mapping set consists of mapping cases 
representing the configuration which directs Architecture elements to run identified 
functions of subtasks. 

Y-Sim simulates each mapping cases in mapping set. Y-Sim handles resource 
conflict when many subtasks in given application simultaneously mapped on the 
same hardware element. The basic process of simulation is to let data units flow from 
the beginning of given application to the end of it. The measure of performance 
analysis, the timing information for each mapping is gathered when each Architecture 
Elements processes each data unit. Total elapsed time, one of the performance indica- 
tors, is calculated when all the given data units are processed. Other performance 
indicator includes throughput, the time taken to pass a data unit through the applica- 
tion, delay resulted from requested resource conflict, parallelism and usage of each 
architecture element. A system designer can choose the best mapping or the best 
system configuration from the performance indicators. 



3.2 Structure of Y -Sim 

Y-Sim consists of application simulator, architecture simulator and architecture- 
application mapping controller. The application simulator constructs application 
model from given application and simulates the flow of data units. The architecture 
simulator simulates each architecture element containing functional elements which 
executes requested function of data unit they received. The architecture-application 
mapping controller generates and maintains mapping information between subtasks 
and architecture elements and route of data unit from application subtasks to their 
mapped architecture elements. Fig. 3 shows the structure of Y-Sim constructed with 
an application shown before. 

Application Simulator: The application simulator simulates the flow of data unit 
and controls the flow of data unit between processing elements which represent sub- 
tasks of given application. Each processing elements add their function request to 
data units received. The application simulator sends them to architecture-application 
mapping controller. Data units routed to and processed in architecture simulator are 
returned to the processing element and the processing element sends the data units to 
the next process. The application simulator generates data units and give them to the 
first processing element of generated application model and gathers processed data 
units to calculate the time taken to process a data unit with given application model 
and delay. 

Architecture Simulator: The architecture simulator simulates the service of func- 
tional element requests from application simulator and calculates execution time 
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taken to execute each data unit. When receiving a data unit, a processing element in 
application simulator invokes application-architecture mapping controller to ask that 
a functional element in an architecture element processes the data unit and other re- 
quest should wait until the request is served. In addition, when the exclusion flag in 
some functional elements is set, the architecture element exclusively executes the 
functional elements. Because of the exclusion, hardware resource request conflicts are 
occurred under the many subtask-to-one architecture element configuration and many 
subtask-to-one functional element in an architecture element configurations. In fig 3, 
for example, resource conflict would occur in some mapping because of subtask 1 
and 3 issuing the functional element request to the same functional element. Y-Sim 
takes into account the conflict and resolves it. After serving the request, the functional 
elements record execution time in the data unit and return it to the application simula- 
tor, which returns it to the processing element the data unit originated from. Each 
architecture element records its time for each mapping case. 




Fig. 3. The Structure of Y-Sim 
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Architecture-Application Mapping Controller: The architecture-application 
mapping controller generates all the possible mapping using information of subtask 
and functional elements of each architecture element and subtask information of 
given application. The size of mapping set is IJi j where i is the number of functions 
which subtasks of given application executes and j is the number of possible func- 
tional elements which can execute functions of subtasks. The size of mapping set in 
Fig. 3 is 2 since function 2 requested by subtask 2 can be executed in architecture 
element 1 and architecture element 2. Y-Sim iterates each mapping case and during 
the simulation, the architecture-application mapping controller routes data units sent 
by processing elements in the application simulator to corresponding architecture 
elements in architectural simulator. Performance indicators of each iteration are gath- 
ered and used to choose the best mapping. 



4 Simulation Time Reduction Heuristic 

The architecture-application mapping controller generates IJi j mapping. Therefore, 
as the number of subtasks of given application and the number of functional request 
of each subtask increases and as more architecture elements and functional elements 
in them are accommodated, the size of mapping set increases exponentially. The 
performance of simulator is directly influenced by the size of mapping set and can be 
improved if the size of mapping set is reduced before simulation. The simulation time 
reduction algorithm basically decreases the size of mapping set. The simulation time 
reduction algorithm, HARMS, decrease the size as follows: The algorithm first ex- 
cludes impossible mapping cases by computing the available resources of FPGA (the 
number of FPGA cells) and requested number of FPGA cells in functional elements. 
Then it reduces mapping set by analyzing workload-time taken to process a data unit, 
parallelism and their relationship which can be identified when each mapping case is 
generated. 



4.1 Mapping Set Reduction Based on Resource Restriction 

Various functions can be configured and executed simultaneously on the same FPGA. 
However, because FPGAs have restricted number of programmable cells, these func- 
tions should be organized or synthesized to fit in the size of target FPGA. For exam- 
ple, XC5202 from Xilinx has 256 logic cells and the circuit size of below 3,000 
gates can be configured. XC5210 has 1296 logic cells and logic circuit from the size 
of 10000 gates to 16000 gates can be configured. [10] 

Because of this resource restriction, some mapping cases generated in architecture- 
application mapping controller are impossible to be configured. The reduction is done 
by comparing the number of cells in each FPGA of target system and the number of 
FPGAs cell required to configure each mapping case. 
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4.2 Mapping Set Reduction with Analysis of Workload Parallelism 

Reconfigurable Systems improve system performance by configuring co-processors 
or hardware accelerators on reconfigurable devices such as FPGA. When a FPGA and 
a general purpose processor execute the same function, FPGA has better performance 
even though their shortcoming of low clock speed. [1,3]. In addition, reconfigurable 
systems have higher parallelism because functions executed on a FPGA also can be 
executed on a general purpose processors and simultaneously configured functions on 
the same FPGA without interfering each other. In a reconfigurable system using 
FPGA, Parallelism, PI and P2, Workload, W / and W2 and throughput T1 and 72 of 
ml and m2, two mapping cases in a mapping set, M has following relationship. 

IF W1>W2 and P1<P2 then T1<T2 (1) 

Throughput of a configuration accelerated with increased parallelism and de- 
creased workload by FPGA is higher than a configuration without them. Heuristic 
Algorithm for Reducing Mapping Sets (HARMS) is based on this relationship. As 
shown later, HARMS exclude mapping cases which have more workload and less 
parallelism than any other mapping case. HARMS is a heuristic algorithm because 
some specific mapping cases have workload and parallelism which do not satisfy the 
equation 1 and HARMS exclude such mapping cases. 



5 Experimental Results 

Section 5 shows the effectiveness and efficiency of the proposed algorithm, HARMS. 
With simulation results, the efficiency of HARMS is presented showing the percent- 
age of inefficient mapping cases that HARMS can eliminate. A specific case that 
HARMS excludes the best mapping cases is also represented and its reason is ex- 
plained. 



5.1 Simulation Environments 

We assume that the architecture model is based on a reconfigurable system which has 
a general purpose processor and several FPGA and the application model is based on 
H.263 decoder algorithm shown in Fig. 4 and picture in picture algorithm shown in 
Fig. 5 used to display two images simultaneously on a television screen. A data flow 
of the H.263 decoder is modeled to encompass a feedback loop so as the application 
to be modeled as a streaming data processing model. The H.263 is a widely used for 
video signal processing algorithm and its data flow is easy to identify! 13]. Some 
subtasks in picture in picture algorithm request the same functions and resource con- 
flict usually arise when implemented on reconfigurable system. Therefore, these two 
algorithms are appropriate to evaluate the performance of Y-Sim and the efficiency of 




110 



S.-Y. Ahn, J.-Y. Kim, and J.-A Lee 



HARMS. For each algorithm the reconfigurable system of architecture model was 
simulated and mapping cases was generated by Y -Sim. 




Fig. 4. Picture in Picture Algorithm 



Huizntal Lines 




Vatical Lines 



Fig. 5. Picture in Picture Algorithm 
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5.2 Efficiency and Accuracy of HARMS 

In order to show the efficiency and accuracy of HARMS, simulation result illustrates 
that excluded mapping cases are not mostly the best mapping cases. The accuracy of 
HARMS is shown as follows: First, we assume that the generated mapping cases of 
H.263 decoder and picture in picture run on a reconfigurable system with one general 
purpose processor and three FPGA. Then, the accuracy of HARMS is proved by 
showing that reduced mapping set has the best mapping cases. Since HARMS doesn't 
analyze the structure of mapping cases, the best mapping case is excluded under some 
specific system configuration. For these cases, a solution resolving this problem is 
also presented. 



5.2.1 Comparison of Before Running HARMS and After 

Fig. 6 and Fig. 7 illustrate comparison of mapping set before and after applying 
HARMS. The initial size of the mapping set of picture in picture model is 16384 and 
the size is reduced to 1648 by resource restriction. HARMS reduce the size of the 
mapping set to 240 and the reduced mapping set contains the best mapping case. As 
the simulation results shows, HARMS reduces mapping set without excluding the 
best mapping case. 




• Performance of generated mapping cases 
° Performance of mapping cases HARMS chooses 



Fig. 6. Comparison of mapping set of H.263 before and after running HARMS 
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• Performance of generated mapping cases 
o Performance of mapping cases HARMS chooses 



Fig. 7. Comparison of mapping set of Picture in Picture before and after running HARMS 



Table 2. The Efficiency of HARMS 





Initial size of map- 
ping set 


Resource limitation is 
considered 


HARMS is applied 


size 


efficiency(%) 


H. 263(1) 


4096 


4096 


928 


77.3 


H. 263(2) 


729 


624 


186 


70.2 


H. 263(3) 


64 


40 


7 


82.5 


H. 263(4) 


64 


64 


8 


87.5 


H. 263(5) 


4096 


4084 


2175 


46.7 


Pip(l) 


64 


13 


12 


7.7 


Pi P (2) 


16384 


1648 


240 


85.4 



5.3 Efficiency of HARMS 

This section shows the reduction ratio of mapping set when HARMS is applied. Table 
2 shows simulation and reduction results of five system configurations of H.263 de- 
coder and two cases of picture in picture algorithm. Each system configuration is as 
follows: 
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H.263 (1) - two general purpose processors (GPPs) and two FPGAs. 

H.263 (2) - one GPP and one FPGA. 

H.263 (3) - one GPP and one FPGA. 

H.263 (4) - Similar to H.263 (3) but smaller FPGA. 

H.263 (5) - one GPP and three FPGAs. 

Pip(l)-one GPP and one FPGAs, 

Pip(2)-one GPP and three FPGAs, 

The efficiency is the ratio of the size of mapping set HARMS applied on ( Aft_size) 
to the size of mapping set resource limitation is considered(Bef_size). The following 
shows its equation 



Efficiency of HARMS = (Bef_size - Aft _size)/Bef size (2) 

As shown in table 2, HARMS reduces the size of mapping set by about 80 percent. 
In the case of H. 263(5), architecturally same mapping cases appeared on different 
mapping cases in the generated mapping set since the system has three same FPGAs 
of the same size on which the same functional element can be configured. As a result 
of redundant configuration, the efficiency of HARMS is comparatively low, 46.7%. 
In the case of Pip(l), the size of mapping set is minimized by resource restriction that 
the efficiency is very low, 7.7%. 



6 Conclusion 

This paper proposes Y-Sim, a hardware-software partitioning tool. With given appli- 
cation model and architecture model, representing configuration of reconfigurable 
system, Y-Sim explores possible mapping between architecture model and applica- 
tion model and generate mapping set which represents possible mapping set of given 
reconfigurable system. Then, Y-Sim simulates each mapping case in the mapping set 
and finds the best one. Y-Sim takes resource restriction of FPGA such as the number 
of usable cells into account and resolves resource conflicts arising when many sub- 
tasks of given application share the same resource. 

As more FPGAs are installed in a reconfigurable system and more functions are 
configured on FPGAs, subtasks can be mapped on many FPGAs or general purpose 
processors. Consequently, the size of mapping set increases exponentially and the 
time taken to analyze performance of such system is heavily influenced by the size. 
Generally, throughput of reconfigurable system using FPGA has positive relationship 
to parallelism and negative relationship to workload. Based on this relationship, pro- 
posed heuristic algorithm, HARMS efficiently reduces mapping set. Experimental 
result shows that HARMS can reduce the size of the mapping set. For the various 
architecture and application model of reconfigurable system running H.263 decoder 
and picture in picture algorithm, HARMS reduces the size of mapping set by about 
80% and up to 87.5%. 
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Abstract. In this paper, the architecture of a high-performance 32- 
bit fixed-point DSP called DSP3000 is proposed and implemented. The 
DSP3000 employs super-Harvard architecture and can issue three mem- 
ory access operations in a single clock cycle. The processor has eight 
pipe stages with separated memory read and write stages, which allevi- 
ate the data dependency problems and improve the execution efficiency. 
The processor also possesses a modulo addressing unit with optimized 
structure to enhance the address generation speed. A fully pipelined 
MAC (Multiply Accumulate) unit is incorporated in the design, which 
enables 32 x 32 + 72 MAC operation in a single clock cycle. The pro- 
cessor is implemented with SMIC 0.18 fim 1.8V 1P6M process and has a 
core size of 2.2mm by 2.4mm. Test result shows that it can operate at a 
maximum frequency of 300MHz with the average power consumption of 
30mw/100MHz. 



1 Introduction 

Digital signal processor finds its applications in a wide variety of areas, such as 
wireless communication, speech recognition and image processing, where high 
speed calculation capability is of primary concern. With the ever increasing 
applications in battery-powered electronics, such as cellular phones and digital 
cameras, the power dissipation of DSP is also becoming a critical issue. The 
trend of achieving high performance mean while preventing power consumption 
from surging is imposing a great challenge on modern DSP design. 

Various approaches have been proposed to address the challenge by exploring 
different levels and aspects in the entire DSP design flow. [1] [2] This paper, how- 
ever, would focus its efforts on the architecture level design of a DSP to achieve 
both performance and power consumption requirements. It presents the over- 
all architecture of DSP3000, a high-performance 32-bit fixed-point digital signal 
processor. It also proposes novel micro-architectures aiming at increasing the 
performance and reducing chip area. Techniques to reduce power consumption 
of the chip are described as well. 

The rest of the paper is organized as follows. Section 2 gives a detailed descrip- 
tion of the overall architecture of the processor as well as the micro-architectures 
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in pipeline organization, address generation unit and MAC unit. Section 3 de- 
scribes the performance of the implemented processor. Section 4 presents the 
conclusion of the research. 

2 Architecture of DSP3000 

In order to enhance the performance, the design of the processor incorporates 
the ideas of deep pipelining, two-way superscalar and super-Harvard structure. 
Specifically speaking, the core of the DSP has eight pipe stages and two data- 
paths working in parallel. It is also capable of accessing data in two different 
data memory spaces simultaneously. Fig. 1 illustrates the major components of 
the processor, which is composed of the DSP core, on-chip memories, instruction 
cache and peripherals. 




Fig. 1. The block diagram of DSP3000 



The instructions are fetched from instruction cache by the instruction fetch 
unit and passed to the instruction decode unit, where the instructions are decoded 
into microcodes. This unit also has a hardware stack with a depth of sixteen to 
effectively support hardware loops. The decoded instructions are issued to AGU 
(Address Generation Unit) and DALU (Data Arithmetic Logic Unit) for cor- 
responding operations. AGU is responsible for the calculation of the addresses 
of both U and V memories that the core is going to access. Various addressing 
modes, including register direct addressing mode and modulo addressing mode, 
are supported by this module. Among them, modulo addressing mode is partic- 
ularly important to the performance of the DSP and will be examined in detail 
in section 2.2. DALU works in parallel with AGU and accomplishes the arith- 
metic and logic operations such as MAC (Multiply Accumulate), accumulation 
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and shift. Besides, it also offers some powerful bit operations, such as bit set and 
bit clear, to enhance the control capability of the chip so that the chip can also 
be applied as an MCU (Micro Control Unit). The operation of MAC, critical 
to the performance of DSP algorithms, is implemented with a novel structure, 
which makes the chip finish MAC operation in one clock cycle and reduces the 
chip area and the power consumption at the same time. This structure will be 
highlighted in section 2.3. 

2.1 Pipeline Organization and Write Back Strategy 

The data-path of DSP3000 core is divided into eight pipe stages with their 
names P1-P8 respectively, as is shown in Fig. 2. In order to coordinate the 
operations between AGU and DALU, the execution actually occupies four pipe 
stages, i.e., P4-P7. Basically, P4 and P5 in DALU simply pipe the instructions 
down and do no extra operations except for some decode activities to prepare 
for the operations begin on P6.Tlre pipeline organization features the separated 
memory read and write stages in AGU, which are introduced to deal with the 
data dependence problems aggravated by the wide span of execution pipe stages. 
This could be explained with the following instructions: 

MAC RO, Rl, R5 U: (AR0+), RO V: (AR4+), R1 
SUB R2, R4 U:(AR0+), R2 R5, V: (AR4+) 

The first instruction accomplishes MAC operation with the data in register RO, 
Rl and R5 and stores the result into register R5. Meanwhile two parallel move 
operations load RO and Rl with the data read from U and V memory respec- 
tively. The second instruction subtracts R4 from R2 and stores the result into 
R4. Also there are two parallel move operations in this instruction. The sec- 
ond one of them stores the value of register R5, which is the result of the first 
instruction, to V memory. If memory read and write are combined in a single 
stage, this operation must start on the beginning of P6 stage since arithmetic 
operation begins on this stage. Otherwise DALU may not get the desired data 
from memory for corresponding operations. Yet, the desired value of R5 is not 
available until the first instruction finishes the operation on P7. This would in- 
troduce a stall or a NOP between the two instructions to avoid the data hazard, 
reducing the instruction execution efficiency. With the separated memory read 
and write strategy, however, the memory write operation happens on P7 stage. 
The above data hazard can then be avoided by forwarding the accumulate result 
to the input of memory write function unit. 

The new memory access strategy, however, would give rise to memory ac- 
cess contention problems if no other measures are taken. This problem can be 
illustrated by the following instructions: 

MOVE R3,U:(AR6) 

MOVE U:(AR7),R4 

The first instruction writes the U memory indexed by AR6 with the value in R3, 
while the second instruction reads the data from U memory to R4. These two 
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Fig. 2. The pipeline structure of DSP3000 core 



instructions will cause the DSP core to access the U memory in different pipe 
stages simultaneously. In order to avoid this contention, the pipeline incorporates 
a buffer called write back queue in the write back stage P8, as is illustrated 
in Fig. 3. When the control unit detects memory access contentions, it holds 
the memory write operation and stores the corresponding data into the queue 
while continues the memory read operation. That is to say, the memory read 
instruction has a higher priority than the preceding memory write instruction. 
The memory write operation can be resumed as soon as the address bus is free. In 
order to prevent RAW(Read after Write) hazard during the execution of memory 
read operation, AGU first searches the write back queue for the desired data. 
If there is a hit, the data can be fetched directly from the write back queue. 
Otherwise, it goes to memory to continue the read operation. 




Fig. 3. The mechanism of write back queue 



The depth of the write back queue is one, which could be proved by the fol- 
lowing analysis. The contention problem could only happen in a read after write 
situation. If the following instruction is a read operation, the write operation can 
be still suspended in the write back queue and no new write instruction is pushed 
into this queue. On the other hand, if the following instruction is a write oper- 
ation, then the memory bus would be free for one clock cycle, during which the 
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suspended write operation can be finished. Consequently, the write back queue 
is only one level deep, which will not lead to the hardware overhead. Besides 
the benefit of increasing instruction execution efficiency, the proposed pipeline 
organization also contributes to the saving of the power that would otherwise be 
consumed on NOP instructions. 

2.2 Modulo Addressing Unit in AGU 

Modulo addressing is widely used in generating waveforms and creating circular 
buffers for delay lines. [6] The processor supports two kinds of modulo addressing, 
i.e., increment modulo addressing and decrement modulo addressing. Conven- 
tionally, they are realized by creating a circular buffer defined by the lower 
boundary (base address) and the upper boundary (base address + modulus - 
l).When the calculated address surpasses the upper boundary, it wraps around 
through the lower boundary. On the other hand, if the address goes below the 
lower boundary, it wraps around through the upper boundary. [6] The modulo 
addressing unit of the processor adopts this basic idea, yet it employs a modified 
approach to improve the efficiency. 




Fig. 4. The modified modulo addressing scheme 

As is shown in Fig. 4, a bit mask generated according to value of modu- 
lus divides the pre-calculated address data into two sections, the MSBs(Most 
Significant Bit) and the LSBs(Least Significant Bit). The MSBs, together with 
the succeeding A'-bit zeros, form the base address of the circular buffer, where 
2K > modulus and 2 K — 1 <= modulus. The A^-bit LSBs are zero extended to 
32 bits before they are put into the Modulo Unit, where the modulo operation 
is performed. The base address and the partial modulo address generated in 
modulo unit are then ORed together to get the final address. With this scheme, 
the low boundary of the circular buffer becomes zero from the Modulo Unit’s 
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perspective. Consequently, the following algorithm can be applied in the Modulo 
Unit. 

For increment modulo addressing: 

Intermediate = LSBs + QffsetValue; 

If (Intermediate > Modulus) 
PartialModuloAddress=Intermediate-Modulus ; 

Else 

PartialModuloAddress=Intermediate ; 

For decrement modulo addressing: 

Intermediate = LSBs - QffsetValue; 

If (Intermediate < 0) 

PartialModuloAddress=Intermediate+Modulus ; 

Else 

PartialModuloAddress=Intermediate ; 

It is required that Offset Value should be no larger than Modulus, or the result 
is unpredictable. [6] The corresponding hardware implementation of increment 
modulo addressing is shown in Fig. 5(a). The longest path of this structure 
includes an adder, a subtracter and a multiplexer. Yet, further optimization 
on the unit can be carried out by dividing the timing critical path into two 
parallel data-paths to calculate the two possible addresses simultaneously, as is 
illustrated below: 




(a) 




Fig. 5. The diagram of the modulo unit. An, En and Mn are the registers holding 
the LSBs , the offset value and the modulus respectively, (a). The diagram of the direct 
implementation of the unit. (b).The diagram of the proposed implementation of the 
unit, where CSA stands for Carry Save Adder. 
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Datapath 1: LSBs + OffetValue 

Datapath 2: LSBs + OffetValue — Modulus 

The operation of Data-path2 can also be written as: 

Datapath 2: LSBs + OffetValue+ ~ Modulus + 1 

The first three operands in Datapath 2 can be compressed into two data by 
one level CSA (Carry Save Adder) [3]. These two generated data can then be fed 
into a 32-bit adder. The carry in pin of this cascading adder, which is usually 
connected with 0, can be used to accept the remaining data 1. On the other hand, 
the carry out signal of the adder actually reflects the greatness of Intermediate 
and Modulus. That is, the signal is 1 if Intermediate is less than Modulus, 
otherwise it is 0. Consequently, the carry out signal can be used as the select 
signal of the multiplexer so that the compare unit can be removed. The proposed 
modulo addressing unit structure is shown in Fig. 5(b). The critical path of this 
structure includes an inverter, a one level CSA, an adder and a multiplexer. 
Since one level CSA is actually an array of full adders, the delay of CSA equals 
that of a full adder. Further more, the delay on the inverter is negligible when 
calculating the entire delay of the timing path. Consequently, the delay on the 
critical path of this structure is significantly reduced compared with that of the 
former one. The implementation of the decrement modulo addressing follows 
the same idea as that of the increment modulo addressing. It can be achieved 
by adding control signals and multiplexers to the proposed structure so that 
both kinds of modulo addressing can be accomplished by sharing most of the 
components. 

2.3 Fully Pipelined MAC Unit 

In order to accelerate the operating speed of the DSP, MAC unit is divided into 
two pipe stages and takes a latency of two clock cycles to complete the operation. 
Instead of the conventional way to simply put multiplier and accumulator in 
different pipe stages [6], however, a different approach is employed in the division 
of the unit, as is shown in Fig. 6 

In the first pipe stage, a radix-4 modified Booth encoder encodes the 32- 
bit multiplier and multiplicand into 17 partial products [4] [5]. These partial 
products are compressed to two 64-bit data by Wallace Tree constructed with 4-2 
compressors [4] . Traditionally these two results are put into an adder to produce 
the multiplication result. In this processor, however, the two results, together 
with the third operand which is sign-extended to 72 bits, are fed into a one level 
CSA, where they are compressed into another two intermediate results. In the 
second pipe stage, a multiplexer selects the signals piped down from the previous 
stage according to the SEL signal. If the SEL indicates a MAC or a multiplication 
operation, the multiplexer selects the signals from CSA. Otherwise, two 72-bit 
source operands are selected. A 72-bit adder then adds the selected signals to 
get the final result. 

Since the adder inherent in a multiplier is substituted with a CSA, which has 
a delay of only one full adder, the proposed MAC structure achieves a more bal- 
anced pipeline than the traditional one. With the proposed structure, one MAC 
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Fig. 6. The diagram of the proposed MAC unit, where PPRT stands for Partial 
Product Reduction Tree. 



operation only needs one 72-bit adder. With the traditional style, however, two 
adders are involved in the MAC operation, one 64-bit adder for multiplication 
and one 72-bit adder for accumulation. Consequently, this structure can sig- 
nificantly reduce the area cost and the power consumption of the unit by the 
removal of 64-bit addition operation. Although the latency for a single multiply 
operation is increased, the fact that MAC is used more frequently than a single 
multiply operation in DSP algorithms justifies such kind of trade off. 



2.4 Cache Strategy 

The DSP3000 possesses a 2k x 32 bits instruction cache to enhance the per- 
formance and reduce power consumption by preventing the core from accessing 
external memory frequently. The cache is two-way set associative and adopts the 
write-tlrrough scheme. It also employs the LRU (Least Recently Used) algorithm 
to replace the data. 

When cache miss occurs, the DSP core must enter wait state until the desired 
instructions are fetched from external program memory, which may take tens of 
clock cycles. In order to save power during that period of time, a clock gating 
approach is employed to hold the core state. Fig. 7 shows the clock gating circuit 
between the PLL and the DSP core. The core-hold signal is latched by a latch, 
which is used to prevent glitches[7], and ANDed with the clock signal from PLL 
to generate the core clock. In the real implementation, the latched signal of 
core-hold is ORed with seamen before it is ANDed with the clock to meet the 
DFT (Design for Test) requirements. The seamen signal is asserted only when the 
chip is in scan mode. When cache control module detects a cache miss, core-hold 
is pulled down before the end of the clock cycle and the core clock is stopped. In 
that case, no operation would happen within the DSP core until the core-hold 
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Fig. 7. Clock gating structure designed to stop the clock of the core when cache miss 
occurs 



signal is released. With the proposed structure, the dynamic power consumption 
on the clock tree during cache miss is eliminated and thus the average power 
consumption on chip is further reduced. 



3 Performance Analysis 



The processor is modeled by Verilog-HDL and synthesized by Design Com- 
piler with the SMIC (Semiconductor Manufacturing International Corporation) 
0.18 /im general standard cell libraries. The physical implementations of the pro- 
cessor, which include floorplanning, placement, clock tree synthesizing and rout- 
ing, are carried out with Synopsys back end tools. Fig. 8 shows the final layout 
of DSP3000. The processor is fabricated with SMIC 0.18^m 1.8V 1P6M process 
and is tested on ADVANTEST T6672. The test result shows that processor can 
operate at a maximum speed of 300MHz with the average power consumption 
30mw / 100M H z . Table 1 summarizes the main characteristics of the processor. 



Table 1 . DSP3000 Characteristics 



Item 


Characteristics 


Process 


SMIC 0.1 8um 1P6M 


Core Voltage 


1.8V 


10 Voltage 


3.3V 


Core Size 


2.2 mm x 2.4mm 


Die Size 


4.8mm x 4.8mm 


Operating frequency 300MHz 
Average Power 


Consumption 


SOmw/lOOMHz 
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Fig. 8. The layout of DSP3000 



Table 2. Comparison between DSP3000 and Other Commercialized DSPs 



Item 


DSP3000 


TI C54X* 


TI C55X* 


Blackfin* 


Category 


32-bit 


16- bit 


16-bit 


dual MAC 16-bit 




fixed-point 


fixed-point 


fixed-point 


fixed-point 


Operating frequency 


300 MHz 


160 MHz 


300 MHz 


350 MHz 


Benchmark:256-point FFT 2257 cycles 8542 cycles 4786 cycles 3176 cycles 


Benchmark:Real FIR 


N/ 2 + 6 


N/2 + 16 


1V/2 + 3 


N/2 + 2 


Benchmark: Complex FIR 


41V + 12 


81V + 13 


2IV + 4 


2N + 2 


Benchmark: Delayed LMS 


21V + 10 


21V + 14 


2IV + 5 


1.51V + 4.5 



♦ Source: http://www.ti.com ; http://www.analog.com 



Table 2 makes a comparison between DSP3000 and some other commercial- 
ized DSPs. Several benchmarks are listed in the table, which include 256-point 
complex radix-2 FFT with bit reversal, real coefficient FIR, complex FIR and 
Delayed LMS(Least Mean Square) filter. All the benchmarks take the unit of 
clock cycles per output sample unless otherwise noted. The data of the commer- 
cialized DSPs are based on the benchmarks provided by the cited websites. Due 
to the optimized MAC unit and enhanced AGU capability, DSP3000 exhibits 
competitively low clock cycle latency and short execution time in running such 
programs. 

4 Conclusion 

This paper presents the overall architecture as well as the novel micro- 
architectures of DSP3000, which is a 32-bit fixed-point digital signal proces- 
sor. With the improvement in pipeline organization, address generation and 
MAC operation, the processor enjoys high efficiency in executing DSP programs 
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mean while achieves a low average power consumption. Test result shows it 
can reach an operating frequency of 300 MHz with average power consumption 
30mw/100MHz. This processor can be applied in fields where high performance 
and low power consumption are both required, such as mobile phones and digital 
cameras. 
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Abstract. TengYue-1 is a microprocessor subsystem for embedded 
applications. Its heart is a 32-bit RISC microprocessor based on an instruction 
set architecture (ISA) designed by us. Through a WISHBONE compatible 
on-chip bus, the microprocessor, a universal memory controller, a LCD 
controller and other peripheral I/Os formed the SOC. TengYue-1 has been 
implemented and verified in SMIC 0.1 8um CMOS technology, and the 
maximum clock frequency is 300MHz@1.8V. This paper presents the design 
and implementation of TengYue-1. We used 9 ARM benchmarks to evaluate 
the performance of the microprocessor and the results showed that it met our 
goal. We also found a simple solution to the memory access conflict problem 
caused by the microprocessor core and the LCD controller. 



1 Introduction 

Currently, embedded system is hailed by major semiconductors and mobile device 
manufacturers. They need simple, light, and low power micro-controller, not high 
performance general purpose microprocessor. Obviously, current design goal is lower 
power and higher performance within given constraints. And both on-chip and 
off-chip configuration of peripheral device must be possible for various market 
demands. 

TengYue-1 is a design for embedded systems based on above characteristics. 
TengYue-1 is an ‘island’ containing a 32-bit RISC microprocessor core (named CH) 
and other macro modules, such as a universal memory controller, a LCD controller 
and several peripherals I/O devices. Hardware debugging support and a production 
test interface are also included. TengYue-1 was implemented in SMIC 0.18um CMOS 
technology. Maximum clock frequencies can be 300MHz@1.8v. The design is 
oriented for authentication and data encryption/decryption in information security 
application. Coprocessor interface and abundant peripherals make it easy for system 
integration. 

The remainder of this paper organized as follows. Section 2 shows the architecture 
of TengYue-1, and section 3 describes the design of each component of TengYue-1 in 



1 TengYue: In Chinese means jump over. 
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detail. Implementation scheme is explained in section 4. Section 5 gives the result of 
performance evaluation. Finally, section 6 gives the conclusions. 



2 Architecture of T engY ue- 1 

As shown in Fig. 1, Tengyue-1 consists of a 32 bit RISC microprocessor core, a 
universal memory controller, a LCD controller, two UART serial ports, an I 2 C 
interface, 32-bit GPIO interface, programmable interrupt controller, power 
management unit and test interface. 

The microprocessor core is the heart of TengYue-1. The instruction set architecture 
is a typical RISC architecture. The instruction set is summarized in Table 1. At the 
design of the instruction set, we emphasize design for pipelining efficiency and 
efficiency as a compiler target. In contrast to ARM architecture [2], CH architecture is 
simple and efficient for implementation. After detailed analyzing of the execution of 
ARM instructions [3], on the base of ARM instruction set, we add shift instructions 
and reduce the addressing modes of data manipulation instructions. Thus the length of 
execution path is reduced. We increase the number of GPRs, to help the compiler to 
exploit more parallelism. 
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Fig. 1. Architecture of TengYue-1 



LCD 

control let- 



CH has 32 32-bit general-purpose registers (GPRs), named RO, Rl... R31. These 
GPRs can be mapped as two sets of 32-bit GPRs separately; each set contains 16 
registers, for fast context switching in case of exception handling; or mapped as a 
single set of 32 32-bit GPRs. CH uses Harvard architecture, with separated instruction 
cache and data cache. 
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Table 1. Instruction set summary 



Instruction Type 


Number of this type 


Branch 

Data Manipulation 
Load/Store 

Status Register Transfer 


4(16 conditions) 

27 

12 (6 addressing modes each) 
2 


Exception generating Instructions 


1 


Coprocessor instructions 


5 (2 for memory access, 6 
addressing mode each) 


JVM extension instructions 


6 


Total number 


57 



As shown in Fig. 1, the grey area is the RISC microprocessor core. To achieve high 
performance, we also have to pay attention to the clock frequency, execution 
efficiency (measured in terms of IPC or CPI), die size and power consumption in the 
design. 



2.1 Microprocessor Core: CH 

The 32-bit RISC microprocessor CH consists of two parts: the instruction pipeline 
and the memory subsystem. We will discuss these two parts in the following 
subsections. 

2.1.1 Instruction Pipeline 

In general, RISC architectures use pipelining as much as possible, in order to 
parallelize tasks and to use existing hardware resources efficiently. This usually 
results in a higher clock frequencies and therefore higher performance values. 
However the use of long instruction pipeline has some drawbacks. The longer the 
pipeline the more cycles is required to refill the instruction pipeline when executing a 
taken branch. CH used a classical five-stage pipeline, including Instruction Fetch 
stage (IF), Instruction Decode stage (ID), Execution stage (EXE), Memory Access 
stage (MEM) and Write Back stage (WB), as shown in Fig. 2. As CH has a single 
issue pipeline, the IF stage fetches one instruction from the instruction cache every 
cycle. The ID stage performs the instruction decoding and prepares the register 
operands. The instruction decoding can be done quickly because of the fixed 
instruction set encoding. Hazards (structure hazards and data hazards) detecting and 
resolving are also performed in ID stage. CH detects data hazards by Scoreboarding 
[1] and resolves them with bypassing and forwarding or simply stopping the 
successive instructions until the instruction that causes the hazards is completed. The 
EXE stage contains an ALU, a barrel shifter and a multiplier, executing arithmetic 
and logical instructions, shift instructions and multiplication instructions respectively. 
The EXE stage also generates the address for load/store instructions. Memory is 
accessed in MEM stage. In WB stage the results of the EXE stage or the data loaded 
from memory are written back to the register file. 

Precise exception can be easily implemented since the hazards detecting and 
resolving algorithms are simple. When an interrupt or exception is detected, it only 
needs to flush the pipeline before handling the interrupt or exception. 
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Fig. 2. Instruction Pipeline of CH 

2.1.2 Memory Subsystem 

The memory subsystem contains the instruction cache, data cache, write buffer and 
the memory management unit (MMU). The instruction cache and data cache are both 
8KB, 4-way set-associative and both use LRU replacement algorithm. The write 
buffer has 32 16-byte entries. The instruction cache and data cache are both pipelined 
so that it is impossible to offer a instruction every cycle. All units in the memory 
subsystem can be enabled or disabled by user through configuring memory control 
register. Instruction TLB (Translation Look-aside Buffer) and data TLB are both 
full-associative. 

The instruction cache uses virtual address as index and tag. A 7-bit identifier labels 
different processes. Thus the instruction cache does not need to be flushed on a 
context switch [1]. To avoid alias, the data cache uses physical address as index and 
tag. Using physical address has two advantages. Firstly, this resolves consistency 
problem of data cache; secondly, it is convenient for sharing data among processes (or 
threads). On a context switch, the data cache does not need to be flushed, so new 
processes can use the shared data without any costs [1]. 

Both the instruction cache and data cache use LRU replacement strategies. We 
implemented a simple systolic array that can keep track of LRU information for a set 
of cache lines. It can handle one cache access every cycle with less hardware cost [5]. 
The structure of a node of the array is shown in Fig. 3 (b), where L is the LRU entry, 
M is the MRU entry, and index is the cache line access order kept by the array. The 
operation is illustrated in Fig. 3 (a). 
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The write buffer works with the data cache. As the data cache uses write through 
policy. On write hit, the CPU write both the data cache and write buffer; On write 
miss, the CPU write into write buffer only. Whenever there is read miss, the write 
buffer is write back to the external memory before cache line refill begin. On read hit, 
the cache line in the data cache is the newest one, same as the external memory. 

The virtual address is 32-bit and the physical address is 26-bit. Virtual memory is 
divided into pages. There are four kinds of page sizes: 1MB, 64KB, 4KB and 2KB. 
Memory protection information is kept with page table. The virtual-physical address 
translation is done by software. The TLB in the memory management unit acts as a 
cache to keep the historical conversion results, so a hit in TLB can accelerate the 
address translation. 



2.2 Memory Controller 

Interfaced with off-chip memory, the memory controller has 8 chip selects, each one 
is individually programmable. The memory types supported by TengYue-l's memory 
controller include SSRAM, SDRAM, FLASH and ROM etc. As a universal memory 
controller, the user can set the memory spaces mapped by the memory on each chip 
select and the timing sequences of the controller's interface by setting the chip select 
control registers and timing control registers to accommodate to different memory. 

Fig. 4 shows the structure of the memory controller. The CPU bus interface and the 
memory interface are responsible for the communication with CPU and with memory 
individually. The configuration registers contain the information such as the types of 
memory, timing information, address mapping etc. All data paths and address paths 
are controlled by a FSM, which generates sequence of control signals according to the 
information stored in the configuration registers. The Power-on configuration block 
latches the value of the memory data bus during reset, which determines initial 
configuration of the memory controller and provides additional configuration bit for 
the system. The refresh counter and SDRAM control module are responsible for 
generating refresh cycles request and timing sequence for the attached SDRAM. 






TengYue-1: A High Performance Embedded SoC 



131 




Fig. 4. Architecture of memory controller 



The memory controller utilizes two clocks. One is the main clock of the system, 
the clock of microprocessor; the other is derived from main clock by dividing the 
main clock by two. The two clocks are required to be synchronized, but it's hard to 
keep strict synchronization in the actual physical circuits. We avoid the metastability 
problem caused by signals transfer across clock domains by two-level 
synchronization and this method works well in the implementation [4], 



2.3 Peripherals 

The peripherals of TengYue-1 include LCD controller, UART serial ports, SP1, I 2 C, 
GPIO interfaces, and programmable interrupt controller, power management unit, 
hardware debug and test interface. All peripherals are connected to the 
microprocessor by the on-chip bus. The peripherals are mapped into the addressing 
space of main memory and accessed by the microprocessor through load/store 
instructions. 

The LCD controller can provide independent horizontal/vertical synchronization 
and combinational synchronization output signals. The size of screen and the 
polarities of each video timing signal can be programmed by user, thus providing 
compatibility with almost all available LCD displays. The LCD controller can support 
a number of color modes, including 32bpp (bit per pixel), 24bpp, 16bpp, 8bpp 
grayscale and 8bpp pseudo color. The color lookup table is inside the controller, to 
reduce memory bandwidth request. Video memory and color table are both double 
bank, which can be used to reduce flicker and cluttered images. This feature is good 
for application such as video game and video stream application. 
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TengYue-1 has power manager. It has three modes: run mode, idle mode and sleep 
mode. These modes are used to reduce power consumption at times when some 
functions are not needed. When the chip is under the latter two modes, it can be 
awoken by interrupts or software conditions. The interrupt controller can connect up 
to 32 external interrupts. The interrupt priority levels and response modes (level 
sensitive or edge sensitive) can be programmed by the user. The hardware debug and 
test interface help the test of the chip and accelerate the development of application. 



2.4 On-Chip Bus 

The WISHBONE compatible on-chip bus of TengYue-1 connects multiple master 
devices to multiple slave devices. The microprocessor acts as the master device, while 
the memory controller and the various peripherals are slave devices. The LCD 
controller can access the memory directly through its own DMA channel, but it also 
works as the microprocessor's slave device. So it is not only a master device, but also 
a slave device. When a slave device has multiple master devices, their access 
sequence is decided by priority bit in the bus configuration register. 




k_ f I I'-"- :: n • “ - ■ i. : 

■V— - - : : > 

I ^ 

Fig. 5. TengYue-1 test chip’s GDSII-based plot 



3 Implementation 

We use Synopsys Design Compiler for synthesis. The synthesis methodology is a 
combination of top-down and bottom-up synthesis. We use bottom-up synthesis for 
the critical modules by constraining their ports, paths, load and fan-outs; other 
modules are synthesized by top-down method. In the physical design, Physical 
Compiler is used to generate cell placement, freezing the data cache and instruction 
cache locations during the placement process and permitting the floorplanning to be 
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data-driven. We generated the balanced clock tree with Cadence’s CTGen. The 
maximum simulated skew across the core was 125ps under the worst case process and 
environmental conditions. 

We fabricated the TengYue-1 in SMIC’s 0.18um CMOS technology. Fig. 5 shows 
GDSII-based plot of the test chip. Clock frequencies achieve 300MHz@1.8V; power 
dissipation is 0.47mW/MHz. Die size is 4.9mm*4.9mm in the total. For the 
microprocessor core, including instruction cache and data cache, die size is 4.73mm 2 . 



4 Performance Evaluation 

4.1 Microprocessor Core Performance 

Analysis of the results for 9 ARM benchmarks showed that the microprocessor core 
CH achieved an 1PC count of about 0.64. All the benchmark programs are written in 
C language and compiled by gcc. Fig. 6 shows the CPI of each program. 




□ CPl| 



Fig. 6. IPC of benchmarks 



4.2 Study of Memory Access Conflict Problem 

Because the video memory of LCD is in the main memory, the memory controller has 
two master devices: the microprocessor and the LCD controller. Thus there exists 
conflict between memory access from the LCD controller and the microprocessor. If 
the LCD controller occupied most of the memory bus cycle, then the performance of 
CPU executing program or LCD driver maybe affected. For LCD driver, it can not 
write display information in time. One solution to this problem is to isolate the video 
memory of LCD from the main memory. The cost is adding another memory 
controller and the on-chip bus becomes more complex. 

In our design, the microprocessor core and the LCD controller share the memory 
controller. To prove this method can satisfy the performance requirements of the 
system, we built a queue model to simulate this problem, as shown in Fig. 7. 
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Fig. 7. Queue model 
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Fig. 8. Simulation result (CPU average memory access delay) 

Queue 1 contains the memory access request from the LCD controller. As the LCD 
displays require pixel values being put on the screen in a fixed frequency, we suppose 
the arrival interval of video memory access request to be a fixed length distribution. 
Suppose the display period of the LCD selected is 156.25ns [7]. When the clock 
period of the microprocessor is 5ns(200MHz), to 32 bpp mode, a 32-bit word must be 
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read out from memory every 31.25 cycles; to 16 bpp mode, a 32-bit word must be 
read out from memory every 62.5 cycles; 24 bpp and 8 bpp mode can be such 
deduced. 

Queue 2 is for the microprocessor. We suppose that the arrival interval of memory 
access request from the microprocessor is an exponential distribution. 

The memory controller acts as the server and the service time is a fixed value: 10 
cycles without burst mode and 13 cycles with burst mode [6]. The burst mode returns 
four 32-bit words on one memory access while the non-burst mode returns just one. 
Suppose that the LCD controller has enough buffers to cache pixel data, the 
microprocessor doesn't use burst mode and the LCD has higher priority when conflict 
occurs. 

We built a queue model by GPSS according to the previous assumptions and 
simulated the memory access behavior of the LCD controller and the microprocessor 
under different LCD display modes [8]. We studied the effect of memory access 
conflict to the performance of system. Fig. 8 shows the simulation result. 

From Fig. 8 we can conclude that the LCD controller brings considerable 
impairment to the memory access of microprocessor when it doesn't use burst mode. 
In the worst case, about 33% load/store instructions of the microprocessor are delayed 
with an average of 11 cycles. However, when burst mode is enabled, the delay of 
memory access of the microprocessor is reduced to 1-2 cycles and the affected 
instructions are only 8% of all load/store instructions under the worst case. The cost is 
only three additional cycles to the memory access time of LCD controller. 

Thus it can be seen that mapping the video memory to the main memory and 
sharing the memory controller between the LCD controller and the microprocessor 
can satisfy the performance requirement. The affection to the memory access of the 
microprocessor brought by the LCD controller can be reduced and even ignored when 
using the burst mode of the memory. 



5 Conclusions 

TengYue-l is a high performance embedded SOC design. We studied the critical 
issues of the design and improved the overall performance by reducing the 
complexity of the hardware efficiently. This chip has been implemented and verified 
in SMIC 0.18um CMOS technology, and the frequency can achieve 300MHz@1.8V. 
TengYue-1 has a broad application prospect on RS encoding-decoding, information 
encryption/decryption and safety authentication. 
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Abstract. The microprocessor is a crucial component of a reliable system. With 
improvement in semiconductor manufacturing, more and more transistors may 
be integrated into a single chip with increased potential detriment to 
dependability. Fault-tolerant single-chip multiprocessors offer an ideal 
architecture for achieving high availability while maintaining high performance. 
The design of a fault-tolerant single-chip multiprocessor is described - from 
hardware redundancy to software support and firmware information strategies. 
The design aims at masking the influences of errors and automatically 
correcting system states, which differs from traditional approaches which 
mainly target errors in the memory and I/O subsystems. Dynamic recovery and 
reconfiguration are also described to provide adequate protection from 
catastrophic failure of the system. 



1 Introduction 

The growth of dependency on computer systems demands microprocessors which 
provide higher dependability whilst maintaining high computing performance. This is 
particularly true for mission-critical applications. With the development of deep- 
submicron technology, it is predicted that in the next 10 years a single chip can 
contain more than one billion transistors[l ]. However, shrinking geometries, lower 
power voltages and high frequencies also have a negative impact on dependability. 

Recently, as the trend toward thread-level parallelism matures, single-chip 
multiprocessors(CMP) present a promising solution to partly mitigate these 
influences[2-4]. CMPs integrating multiple processors into a single chip execute 
several threads concurrently to achieve high computing performance. The advantages 
of this technique include simplified design of critical paths and shrinking of 
development time and cost. 

From an architectural viewpoint, CMPs combined with fault-tolerant techniques 
can further improve microprocessor dependability. Such designs follow two reliability 
principles. First, they observe the classical maxim: “simple is reliable”. A design 
achieves this goal by incorporating simple control logic and replacing the traditional 
complex parallel structures with multiple simple processors. Second, CMPs, which 
may use multiple identical processors for error detection and recovery, have inherent 
features leading to flexible fault-tolerant architectures. Different fault-tolerant 
strategies can be implemented neatly by reasonably dispatching available redundant 
components. 
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In this paper, we propose an architecture for a fault-tolerant single-chip 
multiprocessor (F-CMP). Differing from the IBM pSeries 690[5] which pursues RAS 
and can tolerate duration of repair, F-CMP provides high levels of reliability and 
availability with strong automatic recovery capability. The architecture is 
configurable and supports the replacement of faulty components and the degeneration 
to lower reliability levels when uncorrectable errors occur. Users can also specify a 
fault-tolerant mode corresponding to the dependability requirements of ant 
applications. Fault-tolerant strategies have been designed with special characteristics 
to tradeoff among hardware and software. 

The rest of the paper is organized as follows. Section 2 presents an overview of F- 
CMP architecture. Section 3 discusses the fault-tolerant design techniques used in F- 
CMP, including hardware redundancy, firmware and software support. Section 4 
describes the recovery strategies of F-CMP and Section 5 concludes. 



2 System Overview 

The F-CMP architecture is based on Tsinghua University’s Thump-107. The Thump- 
107 is a RISC-based microprocessor targetting embedded applications. Its instruction 
set is a superset of MIPS-4Kc and is compatible with MIPS 32-bit RISC architecture. 
It has a 4k-byte instruction cache, a 4k-byte data cache and a 7-stage pipeline 
structure able to execute up to seven instructions per clock cycle. 

From an architectural point of view, F-CMP is a closely coupled multiprocessor 
that contains four identical Thump- 107 processors, a shared cache and necessary 
control logic needed to realize the fault-tolerant strategies. A logical overview is 
shown in Fig. 1 . 

F-CMP has six kinds of functional units: 

• Four identical Thump- 107 processors 

The Thump- 107 processor is an independent 32-bit CPU core, which implements 
the MIPS 4Kc instruction set plus four instructions for multimedia applications. In F- 
CMP, the processors reuse the basic Thump- 107 structure, adding two instructions 
specially for implementing synchronization primitives used by a standard CMP. 
These two instructions are the load locked(LL) and store conditional(SC) instruction, 
respectively. Each processor also includes an 8k-bytes instruction cache, an 8k-bytes 
data cache and corresponding control logic. The LI cache is a write-through primary 
cache that allows all processors to snoop on all writes performed. 

• Fault Handling Mechanism 

A fault handling mechanism consisting of a crossbar and four sets of fault-tolerant 
selectable logic was designed to detect computing errors by comparing results from 
independent processors. Logically, the crossbar controlled by the centralized 
arbitration controller connects the outputs of four identical processors with select 
logic. Four fault-tolerant computing modes are provided: one-mode, dual-mode, 
triple-mode and quadruple-mode redundancy, named after the numbers of 
participating processors. 
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Fig. 1 . Logical overview of F-CMP architecture 



• A shared secondary cache 

Four processors share a unified L2 cache organized as two cache banks with 
separate controllers. The L2 cache can protect data with standard CRC words and its 
capacity is as large as 2M-bytes. In the non-fault-tolerant mode, the shared cache can 
be organized as a unified interleaved cache. In the fault-tolerant mode, the L2 cache 
consists of two independent sub-caches, each of which backs up the other. The main 
cache bank and the backup bank make up a fault-tolerant memory subsystem. 

• A centralized arbitration controller 

Besides controlling the crossbar to provide fault-tolerance, the centralized 
arbitration controller manages access privileges on the shared bus. In F-CMP, three 
logic data buses including a write-through bus and a read/replace bus and a backup 
bus are used for data transmission. The backup bus is a standard data bus and may 
take over the tasks of the other buses if necessary. 

• Main memory interface (MIU) 

The MIU handles all the interfacing transactions from and to F-CMP, including 
main memory accesses and external snoop processing. 

• I/O interface unit 

The I/O interface unit handles all the input and output transactions of F-CMP. 
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3 Fault-Tolerant Design Techniques 

F-CMP provides different levels of fault tolerance, including hardware redundancy, 
software support and firmware-based recovery. The architecture also allows system 
reconfiguration and dynamic graceful degeneration. 



3.1 Fault-Tolerant Hardware Design 

F-CMP provides some special structures to satisfy the requirements of different levels 
of fault tolerance. Here, three fault-tolerant strategies are presented: 

• Fault-tolerant Computing 

To achieve high reliability in the course of computing, three fault-tolerant 
strategies based on comparison techniques are provided in F-CMP. Fig. 2 shows the 
logic structure of different redundancy modes. 



OMR : One-Mode Redundancy; DMR: Dual-Mode Redundancy; I 

IMP.: Triple-Mode Redundancy; QMR: Quadruple-Mode Redundancy I 




Fig. 2. Logical structure of fault handling mechanisms 



In the fault handling mechanism, there are four sets of fault-tolerant logic. 
However, at any time, only one set of logic is activated. 

One-mode redundancy represents a standard single-chip multiprocessor, in which 
every processor is an autonomous subsystem running applications independently. It is 
a non-fault-tolerant mode. In dual-mode redundancy, two processors form a simple 
comparison-based fault-tolerant subsystem while the others run as two independent 
single-processor subsystems. Similarly, triple-mode redundancy includes an 
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autonomous single-processor subsystem and a voting fault-tolerant subsystem of three 
processors. The most complex form, quadruple-mode redundancy, uses all the four 
processors for fault-tolerant computing. In this mode, each pair of processors form a 
simple comparison-based fault-tolerant subsystem, and their outputs are compared 
once more to output a single result. 

The fault-tolerant modes are controlled by an error-capture register (ECR). 
According to the different connection of the output of the processors, one-mode, dual- 
mode, triple-mode and quadruple-mode redundancy require 1 bit, 6 bits, 4 bits and 3 
bits, respectively, in the ECR. As a result, the ECR has 14 controlling bits all 
together, each of which corresponds to a computing mode. 

• Reliability Data Transmission 

Connecting the processors, secondary cache and other control interfaces together 
are a read/replace bus, a write-through bus and a backup bus, along with address and 
control buses. While the read/replace and write-through buses are virtual buses, the 
physical wires are divided into multiple segments using repeaters and pipeline 
buffers. The backup bus is an independent physical bus, which implements standard 
reads and writes. Thus it can be used to replace the other two logical buses. As a 
result, the bus structure is actually a dual-bus system which protects transmitted 
information. 

The read/replace bus acts as a general-purpose system bus for moving data 
between the processors, secondary cache and external interface to off-chip memory. 
The processors can fetch required data from secondary cache via the read bus 
simultaneously. Data from the external interface can be broadcast to both of the 
processors while it is sending to secondary cache. 

The write-through bus permits F-CMP to use a write-update coherence protocol to 
maintain coherent primary caches. Data exchange among the processors is carried out 
via a write-through bus under the control of the centralized bus arbiter. When one 
processor modifies data shared by other processors, write broadcast over the bus 
updates all copies while the permanent machine state is written back to the secondary 
cache. 

The backup bus design has two objectives. First, it can be used as an independent 
read and write bus to speed up communications between two levels of cache. Second, 
it can also be used as the single data bus when uncorrectable failures are detected on 
the other logical buses. 

As in other high reliability microprocessor systems, data on the bus is augmented 
by a single-error-correct and double-error-detect Hamming ECC to further enhance 
fault- tolerance. 

• Fault-tolerant Data Store 

Both LI and L2 caches have memory protection mechanisms designed to correct 
single event errors. Several bits are added to the banks to replace faulty bits. These 
techniques are analogous to the programmed steering logic in the structure of 
POWER4[6]. The controlling logic is activated by built-in self-test at power on. 

As mentioned above, beside a normal unified cache, an L2 cache consisting of two 
banks can also implement a fault-tolerant memory subsystem by configuring one bank 
to back up the other. As a result, F-CMP may run in multiple different storage modes 
according the reliability states of caches. Possible storage configurations are shown in 
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Table 1, where ‘V’ and ‘x’ represent the availability and non-availability of the 
corresponding components, respectively. 



Table 1 . F-CMP Storage Configuration 



Mode 


LI cache 


L2 cache 

Main Bank Backup Bank 


1 


V 


V 


V 


2 


X 


V 


V 


3 


V 


X 


V 


4 


V 


V 


X 


5 


X 


X 


V 


6 


X 


V 


X 


7 


V 


X 


X 


8 


X 


X 


X 



3.2 Firmware 

Firmware was designed to record runtime error thresholds, indicating the number of 
corrected errors in the F-CMP microprocessor. The error information recorded in 
firmware becomes part of a system error log and is used for system reconfiguration. 
The replaceable components include the processors, cache banks and buses. When 
errors are detected in these components, error information is recorded in firmware. 

Dynamic reconfiguration is implemented: firmware logs errors and is used to 
decide the current fault-tolerant mode. Hardware and firmware track periodically 
whether error numbers stay below a threshold. After exceeding this threshold, the 
system will initiate additional runtime availability actions, such as a controlled shut- 
down of processors and the replacement of a faulty cache line or a controlled shut- 
down of banks or even the whole L2 cache. 



3.3 Software Support 

Depending on hardware redundancy to achieve high reliability is not enough and 
software support plays a very important role on the design of F-CMP. Since each 
processor in F-CMP has its own program counter and register file, it is easy to 
execute multiple instruction streams in parallel. Actually, four independent 
instructions streams — generally called threads — can be simultaneously issued to 
the different processors to achieve software implemented fault tolerance (SIFT). The 
operations are scheduled under the control of the operating system. 

As the mentioned above, the fault-tolerant modes of F-CMP are controlled by an 
error-capture register (ECR). This register is software-readable and can be reset by 
operating system. At bootstrap, the operating system may set the fault-tolerant mode 
of F-CMP according to application requirements and the current state of replaceable 
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components in the microprocessor. Hardware and software may cooperate to 
implement runtime fault-tolerant handling, which is transparent to the users. F-CMP 
may also degenerate into a lower fault-tolerant mode if uncorrectable failures are 
encountered. 



4 System Recovery Strategies 



F-CMP implements concurrent error detection, dynamic fault isolation and error 
recovery while running. The most important principle of system recovery strategies is 
to guarantee the dependability of mission-critical applications even if system 
performance must be lowered. As a result, the replaceable components are considered 
first to take over from failed ones during the course of system reconfiguration. The 
error recovery flowchart of F-CMP is shown in Fig. 3. 




Fig. 3. Error recovery flowchart of F-CMP 

While the numbers of errors are beyond the capability of the current fault-tolerant 
mode, F-CMP may commence system degeneration. Corresponding to the fault- 
tolerant levels, there are three kinds of system degenerations in the level of processors 
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and the buses and the caches. Processor-level degeneration has four forms: from 
quadruple-mode to dual-mode or one-mode redundancy, and from triple-mode to one- 
mode redundancy, and from dual-mode to one-mode redundancy, respectively. Bus- 
level degeneration may go from dual-mode to one-mode redundancy. In the cache- 
level degeneration, each cache line and bank and even whole caches may be removed 
as uncorrectable errors increase. 

Besides the system degeneration, F-CMP has dynamic reconfiguration capabilities, 
viz. the state conversion between different fault-tolerant modes. Actually, the dual- 
mode and triple-mode and quadruple-mode redundancy have different dependability 
characteristics and complement each other. For applications requiring more processor 
power, dual-mode is suitable because the two groups of processors may execute 
different threads individually. But for other critical applications, triple-mode or 
quadruple-mode may be much better. System reconfiguration controlled by software 
includes two forms: conversion between dual-mode and triple-mode redundancy, and 
conversion between quadruple-mode and triple-mode redundancy. 



5 Conclusion 

We have described a fault-tolerant single-chip multiprocessor aimed at providing 
adequate protection from system failure. The design has configurable levels of 
hardware redundancy, software support and firmware information techniques. The 
architecture provides multiple fault-tolerant modes and is thus adaptable to different 
mission-critical applications: it also allows smooth conversion between the different 
modes. Since all the redundancy components and corresponding controllers exploit 
the “simple design” principle, it is easy to validate the reliability of subsystems and 
this leads to a high dependability system. 

To verify the properties of the architecture, two related software tools have been 
written. One is a behavior simulator written in C, which performs functional 
validation. The simulator can simulate the operational behavior of F-CMP cycle-by- 
cycle. The other tool is a dynamic runtime error injection system, which randomly 
injects permanent and intermittent and transient errors into the system. Initial 
validation tests showed that the F-CMP architecture improved system dependability 
effectively for some mission-critical applications and reached the design target of 
initial dependability. 
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Abstract. This paper is a first look at the value of the RAMpage mem- 
ory hierarchy to low-energy design. The approach used, dreamy memory , 
is to put DRAM in a low-power mode, unless it is referenced. Simulation 
results show that RAMpage provides a better overall speed-energy com- 
promise than the conventional architecture used for comparison. The 
most energy-efficient RAMpage configuration in dreamy mode ran 3% 
faster and used 71% of the energy for DRAM of the best dreamy run 
of the conventional model. As compared with the best non-dreamy run 
time, the best dreamy time was 9% slower, but used under 17% of the 
energy for DRAM. The lowest-energy dreamy simulation used less than 
16% of the DRAM energy of the fastest non-dreamy version, a very useful 
gain, given that DRAM uses significantly more power than the processor 
in a low-energy design. The most energy-efficient variant ran 12% slower 
than the fastest, allowing several trade-offs between speed and energy. 



1 Introduction 

The RAMpage memory hierarchy moves main memory up a level to replace the 
lowest-level cache with an SRAM main memory, while DRAM becomes a paging 
device. Previous work has shown RAMpage to be a potentially viable design in 
terms of hardware-software trade-offs [16] and that it scales better as the CPU- 
DRAM speed gap grows, particularly when taking context switches on misses to 
DRAM [14]. In this paper, the value of RAMpage in hiding DRAM latency is 
further explored by introducing the idea of dreamy memory. 

Dreamy memory is kept in a low-power mode unless it is referenced. While 
waking the memory up incurs significant overhead, RAMpage could hide this 
overhead as has previously been demonstrated. 

In desktop and server designs, with processor power consumption on the 
order of tens of watts or even over 100W, reducing memory power usage is not a 
major issue. However, with a low-energy design, DRAM energy usage becomes 
significant. A 128Mbyte DRAM as simulated in this study uses about 0.5W, as 
compared with a 500MHz processor of the ARM 11 family [1], which uses about 
0.2W. A small mobile device with a relatively modest memory therefore has to 
allocate a significant fraction of its energy budget to DRAM. 



P.-C. Yew and J. Xue (Eds.): ACS AC 2004, LNCS 3189, pp. 146—159, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 
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In this paper, the approach investigated is to use the self-refreslr mode com- 
monly available in double-data rate synchronous DRAM (DDR.-SDRAM), which 
allows DRAM contents to be maintained with 1% of normal power [17], to imple- 
ment dreamy memory. Simulations are based on parameters suited to a mobile 
device. The aim is to reduce DRAM energy usage to as close as possible to that 
of self-refresh mode, with performance as close as possible to that of full-power 
mode. 

The remainder of this paper is structured as follows. Section 2 presents more 
detail of the RAMpage hierarchy and related research. Section 3 explains the 
experimental approach, while Section 4 presents experimental results. In conclu- 
sion, Section 5 summarizes the findings and outlines future work. 



2 Background 

2.1 Introduction 

RAMpage was proposed [12] in response to the memory wall [21,9], which arises 
mainly with high-end systems, where processor improvements have not been 
matched by DRAM speed improvements. At the low end, energy use is a much 
more significant problem. RAMpage’s ability to hide latency of (relatively) slow 
DRAM can potentially be used to hide the latency of waking a DRAM up from 
a low-power mode. 

In this paper, low energy, rather than low power is the measure of interest, as 
we are concerned with total energy use over time, rather than an instantaneous 
measure. 

The remainder of this section briefly surveys other approaches to low-energy 
memory design, followed by an outline of the R AMpage approach to the problem. 



2.2 Low-Energy Memory Design 

There have been several approaches to reducing the energy needs of memory. 

IRAM (Intelligent R AM) was originally proposed to address the memory wall 
problem, by implementing a large DRAM on-chip with the processor, instead of 
the traditional trend of increasing on-chip cache size. While the on-chip DRAM 
is slower than an SRAM cache, it is faster than an off-chip DRAM [19]. More 
recently, IRAM has been shown to offer the potential for reduced energy us- 
age, because of DRAM’s lower energy requirement as compared to SRAM, and 
elimination of off-chip buses [6] . 

At the low end, work has been done on variations on memory organization 
like multiple banks (less commonly used banks can be put in low-power modes) , 
finding optimum combinations of number of banks and bus width, and exploring 
compromises between performance-optimal and energy-optimal organization of 
caches and DRAM [2] . One specific proposal for a low-energy design for system- 
on-chip (SoC) applications is to organize static RAM into statically allocated 
banks, based on predicted data referencing behaviour [4]. The main problem 
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with this approach is that it requires static allocation, and does not allow for 
changes in the relative sizes of the banks for different workloads. 

The closest idea to that reported here are Power- Aware DRAM (PAD RAM) 
[11] and Power- Aware Virtual Memory (PAVM) [8]: page placement is used in 
a memory in which different chips may be in different power modes. Frequently 
accessed pages are in a DRAM which is not in a low-power mode (or less often 
than other chips). 

In a PAD RAM study, it was shown that putting all DRAM into the lowest- 
power mode resulted in execution time of 2 to 60 times that of full-power mode, 
whereas a dynamic policy resulted in a relatively small speed loss, with signifi- 
cant energy saving. While various details of the PADRAM study differed from 
those reported in this paper (faster processor, smaller L2 cache, Rambus memory 
with higher wakeup latency), the most significant difference is that no operating 
system effects were modelled: single process execution times were reported, not a 
mix of workloads [11]. In addition, only a hardware-managed L2 cache was mod- 
elled, not a software-managed cache like the RAMpage SRAM main memory. 
RAMpage, especially with context switches on misses, relies on a multiprogram- 
ming workload to hide DRAM latency and is therfore able to get away with a 
simpler approach to managing DRAM. 

PAVM has been investigated in more detail, but using an actual implemen- 
tation on Linux and an otherwise-conventional memory hierarchy. Exploiting 
a combination of the different modes available in Rambus and dynamic page 
placement strategies, with DRAM energy savings of up to 59% with a heavy 
workload [8]. 

Since other low-energy techniques can apply to dreamy DRAM, approaches 
in areas such as reducing energy to drive a bus to DRAM [20] and reducing cache 
energy [10] have not been considered in detail as potential competing work. 

2.3 The RAMpage Approach 

RAMpage makes as few changes from a traditional hierarchy as possible. The 
lowest-level cache becomes the main memory (i.e. , a paged virtually-addressed 
memory), with disk used as a secondary paging device. The RAMpage main 
memory page table is inverted, to minimize its size. Further, an inverted page 
table has another benefit: no TLB miss can result in a DRAM reference, unless 
the reference causing the TLB lookup is not in any of the SRAM layers [16]. 

RAMpage has in the past been shown to scale well in the face of the grown 
CPU-DRAM speed gap, particularly when context switches are taken on misses. 
The effect of taking context switches on misses is that, if other work is available 
for the CPU, waiting for DRAM can effectively be eliminated [14] . Performance 
characteristics of RAMpage have previously been reported [16, 13, 14]. For pur- 
poses of this paper, the key advantage of RAMpage is the ability to mask latency 
of DRAM references, with the aim of keeping DRAM in a low-power mode unless 
it is being referenced, without significant loss of speed. 

Compared with most other approaches to low-energy memory systems, the 
RAMpage approach is very simple. No special hardware is needed, other than 
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the RAMpage design itself. DRAM is put into a low-power mode, and turned 
on when it is referenced. As compared with the PAD RAM approach, the archi- 
tecture requires no complex dynamic placement strategy. Provided a process is 
ready to run on a miss to DRAM, the extra wake-up latency can be masked. 
PAVM is closer in philosophy, but RAMpage carries the idea further in man- 
aging the lowest-level cache in software, which has potential for other wins, as 
described in previous RAMpage work [16,14]. 

The dynamic placement strategies of PADRAM and PAVM could be added 
to RAMpage, combining their benefits with a software-controlled SRAM main 
memory. 

3 Experimental Approach 

3.1 Introduction 

This section outlines the approach to the reported experiments. Results are de- 
signed to be comparable to previously reported results as far as possible. The 
simulation strategy is explained, followed by some detail of simulation parame- 
ters; in conclusion, expected findings are discussed. 



3.2 Simulation Strategy 

The approach followed here is similar to that used in previously reported work. 
However, the processor speed characteristics are based on the ARM 11 series [1] 
running at 500MHz. This processor consumes 0.2W at this speed; this power 
consumption makes the power needs of DRAM significant. 

Simulations are trace-driven, and do not model the pipeline. It is assumed 
that pipeline timing is less significant than variations in DRAM referencing. 
Given that the ARM 11 family only issues one instruction per clock and has 
accurate branch prediction, this approach to simulation is unlikely to introduce 
significant inaccuracies. For simplicity, the simulations do not use all features 
of the ARMll series. The ARMll’s two-level TLB is not simulated. Instead, 
a relatively small 1-level TLB is simulated. The RAMpage hierarchy is more 
disadvantaged by this approximation than a conventional hierarchy, since it relies 
on the TLB for mapping pages in the SRAM main memory, rather than in 
DRAM [15]. 

A standard 2-level hierarchy is compared to a similar version of a RAMpage 
hierarchy, with and without context switches on misses. RAMpage without con- 
text switches on misses is intended to convey the effects of adding associativity 
(with an operating system-style replacement strategy) . Adding context switches 
on misses shows the value of having alternative work on a miss to DRAM. In all 
cases, the effect of running with DRAM permanently on is compared with the 
effect of running with DRAM in self-refreslr mode, except when it is referenced. 

In this study, given that energy and cost are more significant than for previ- 
ous studies, L2 is reduced from 4 Mbytes to 1 Mbyte (the simulated 1MB SRAM 
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consumes 0.8W [18]; 4 MB would use 3.2W, significant compared with a 0.2W 
processor). This reduction disadvantages RAMpage more than the standard hi- 
erarchy: part of the SRAM main memory is reserved for operating system data 
and code: in addition to a page table, RAMpage reserves 32 Kbytes for the oper- 
ating system. The ARMll series includes 64K bytes of SRAM (tightly coupled 
memory, TCM) which could be used for the operating system in RAMpage; the 
page table would also fit for SRAM page sizes of 256 bytes or more. This option 
was not explored in this study; using TCM, which operates at cache speed, in 
RAMpage simulations would likely result in significant speed gains. 



3.3 Simulation Parameters 

The processor modelled in this paper is slower than in recent RAMpage work, if 
comparable to one of the speeds in recent older work [16], to take into account 
the slower speeds of low-energy designs. 

A major difference from previously reported results, which used Direct Ram- 
bus, is use of clouble-data-rate synchronous DRAM. The DDR.-SDRAM mod- 
elled [17] has average power usage of 200mA, and self-refresh mode which uses 
2mA, both at 2.6V. In self-refresh mode, the external clock is turned off, and 
contents of DRAM is maintained without external intervention. Actual DRAM 
power usage varies according to the reference pattern, but for this preliminary 
work, an average value is used, and the same value is used for entry to and exit 
from self-refreslr mode. In previous work, detail of the DRAM was not considered 
important, as fixing DRAM speed while speeding up the CPU represented the in- 
creasing CPU-DRAM speed gap. In this paper, DRAM detail is more important 
because power usage is timing-dependent. 

The following parameters are similar to previous simulations except as noted, 
and are common across RAMpage and the conventional hierarchy: 

— LI cache - 16 Kbytes each of data and instruction cache, physically tagged 
and indexed, direct-mapped, 32-byte blocks, 1-cycle read hit time, 12-cycle 
penalty for misses to L2 (or RAMpage SRAM main memory) 

— TLB 64 entries, fully associative, random replacement, 1-cycle hit time, 
misses modelled by interleaving a trace of page look-up software 

— DRAM - DDR400 SDRAM: 40?rs before first reference starts, 64-bit 5ns 
bus (data moves every 2.5?rs: transfer rate approximately 0.3 ns per byte; 
DRAM time to exit self-refresh is 1/zs, and time to enter self-refreslr mode 
is 20ns) 

— paging of DRAM - inverted page table: same organization as RAMpage main 
memory for simplicity, the workload is preloaded, so there are no page faults 
to disk; for energy calculations, a 128MB DRAM is assumed 

— TLB and LI data hits are fully pipelined: they do not add to execution time; 
only instruction fetch bits add to simulated run time; time for replacements 
or maintaining inclusion are costed as Lid or TLB “hits” 

A context switch (modelled by interleaving a trace of text-book code) is 
generally taken every 500,000 references, though RAMpage with context switches 
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on misses also switches processes on a miss to DRAM. TLB misses are handled 
by inserting a trace of page table lookup code, with variations on time for a 
lookup based on probable variations in probes into an inverted page table [16]. 



Specific to conventional hierarchy. L2 cache is 2-way associative, 1Mbyte. 
The bus connecting L2 to the CPU is 128 bits wide and runs at one third of the 
CPU issue rate (6 ns versus the CPU’s 2 ns). The miss penalty from LI to L2 
overall is 12 CPU cycles. Inclusion between LI and L2 is maintained [7], so LI is 
always a subset of L2, except that some blocks in LI may be dirty with respect 
to L2 (writebacks occur on replacement). 

The TLB caches translations from virtual pages to DRAM physical frames. 



Specific to RAMpage hierarchy. The TLB maps the SRAM main memory. 
Full associativity is implemented by a software miss handler. The operating 
system takes up 9 SRAM main memory pages when simulating a 4 Kbyte-SRAM 
page (36 Kbytes), up to 752 pages for a 128 byte block size (94 Kbytes). 

The SRAM main memory uses an inverted page table. TLB misses do not 
reference DRAM, if the original reference can be found in an SRAM level. 



Inputs and variations. Traces used are from the Tracebase trace archive 
at New Mexico State University 1 . Although these traces are from the obsolete 
SPEC92 benchmarks, they are sufficient to warm up the size of cache used here, 
because 1.1-billion references are used, with traces interleaved to create the effect 
of a multiprogramming workload. 

To measure variations on energy use, the size of the SRAM main memory 
page (or L2 block size in the conventional model) was varied from 128 bytes to 
4 Kbytes, and the simulation was instrumented to track energy use. 

In dreamy mode, it was assumed that if a DRAM access started before the 
previous one had completed, DRAM would still be awake. Otherwise, once a 
DRAM reference completed, it was put into self-refreslr mode. For comparison, 
simulations were run with DRAM permanently in full power mode. The simu- 
lator allows for a lag after references before entering self-refresh mode, but this 
option is still to be explored. 

Total energy was calculated by multiplying time in each mode by the power 
of that mode. 

3.4 Expected Findings 

It was expected that speed differences would not be the most significant finding, 
given that early studies [16, 13] showed little difference between RAMpage and 
the conventional model at the clock speed being modelled in this paper. It was 
expected that the introduction of dreamy mode would have less of an effect on 

1 See ftp://tracebase.nmsu.edu/pub/traces/uni/r2000/utilities/ and 
ftp: //tracebase .nmsu. edu/pub/traces/uni/r2000/SPEC92/. 
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RAMpage than on the conventional model, given that RAMpage has been shown 
to be more tolerant of an increased DRAM latency, especially when context 
switches are taken on misses [14]. 

With a significantly smaller SRAM main memory than in earlier experiments, 
it was expected that RAMpage, which pins parts of the operating system in 
the SRAM main memory, would be less competitive on speed than in earlier 
experiments even in dreamy mode, where the increased effective DRAM latency 
would make this experiment closer to earlier ones with faster processors. 

RAMpage, however, has the potential to show a better overall combination 
of not only lower speed loss in dreamy mode and lower overall energy use in 
dreamy mode, than the conventional hierarchy. Since RAMpage in general (and 
more specifically when contect switches are taken on misses) spends a lower 
fraction of its time waiting for DRAM, it is likely that it will need DRAM to be 
in full-power mode less often than the standard hierarchy does. 

4 Results 

4.1 Introduction 

This section presents results of simulations, with some discussion of their signif- 
icance. The main focus here is on comparing the effects of varying the memory 
hierarchy on energy and power use. 

Figure 1 shows an overall comparison of all the variations measured. Of most 
interest is the fact that it’s hard to tell apart speed variations of the best cases 
for each configuration on the same scale, whereas energy variations for dreamy 
and non-dreamy cases are clearly separated. This observation illustrates that 
aiming to save energy while minimizing performance loss is achievable. 

The remainder of this section presents more detail of results. Speed variations 
are followed by energy variations. Finally, design trade-offs are considered. 

4.2 Speed Variations 

Speed variations are shown in Table 1. Speedups are shown for the non-dreamy 
case of the best measured time versus each other time. For dreamy times, 
speedups are given both relative to the same parameters with and without 
dreamy mode (2nd-last column) and the best non-dreamy time (last column). 
The best dreamy and non-dreamy times are highlighted. 

The best dreamy run time is for RAMpage with context switches on misses, 
with a 2KB SRAM page size. Execution time here is 9% slower than for the 
best non-dreamy case (conventional hierarchy, 512B L2 block size). More speed 
variation is accounted for by variations in the SRAM page or L2 block size than 
by using or not using dreamy mode. The slowest dreamy simulated execution 
time is 5.06s; the slowest non-dreamy time is 4.92s, a difference of under 3%. 

The standard hierarchy’s best dreamy time (1KB L2 block size) is 15% slower 
than the best non-dreamy time, while the best dreamy RAMpage time without 
context switches on misses is 12% slower than the best non-dreamy time. 
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Fig. 1. Comparison of speed and energy usage. In all figures, “CX” means with context 
switches on misses. 



Table 1 . Speed variations. Each row shows standard hierarchy (top), RAMpage with- 
out (middle) and with context switches on misses (bottom). 



L2 block/ 


Non-Dreamy 


1 Dreamy Times (s) 


Dreamy Speedups 


page size 


time (s) best speedup 


| asleep awake 


total 


non-dreamy 


best 


128 


2.395 


1.015 


2.224 


1.378 


3.602 


1.504 


1.526 




3.821 


1.619 


3.654 


1.409 


5.062 


1.325 


2.145 




4.919 


2.085 


3.127 


1.124 


4.251 


0.864 


1.802 


256 


2.362 


1.001 


2.220 


0.845 


3.065 


1.298 


1.299 




3.043 


1.290 


2.910 


0.856 


3.766 


1.238 


1.596 




3.638 


1.542 


2.602 


0.698 


3.301 


0.907 


1.399 


512 


2.359 


1.000 


2.220 


0.584 


2.804 


1.188 


1.188 




2.663 


1.129 


2.534 


0.577 


3.111 


1.168 


1.318 




2.977 


1.262 


2.328 


0.466 


2.794 


0.938 


1.184 


1024 


2.386 


1.011 


2.224 


0.481 


2.705 


1.133 


1.146 




2.510 


1.064 


2.368 


0.439 


2.807 


1.118 


1.190 




2.647 


1.122 


2.229 


0.406 


2.635 


0.996 


1.117 


2048 


2.483 


1.052 


2.237 


0.540 


2.777 


1.119 


1.177 




2.454 


1.040 


2.304 


0.354 


2.658 


1.083 


1.127 




2.506 


1.062 


2.201 


0.371 


2.572 


1.026 


1.090 


4096 


2.879 


1.220 


2.293 


1.077 


3.371 


1.171 


1.429 




2.487 


1.054 


2.304 


0.334 


2.638 


1.061 


1.118 




2.562 


1.086 


2.139 


0.538 


2.677 


1.045 


1.135 
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While RAMpage doesn’t do well with small SRAM page sizes - as reported 
in earlier work [16] - time variations for cases with reasonable SRAM page sizes 
are low considering the relatively large energy saving of dreamy mode. RAMpage 
with and without context switches on misses does not differ as significantly as 
in earlier studies with a large CPU-DRAM speed gap and large L2 [14]. Dreamy 
mode does increase the effective CPU-DRAM speed gap: an extra miss penalty of 
500 clock cycles similar to increasing the processor speed to the speeds previously 
modelled. However, as expected, the smaller SRAM main memory used in this 
study disadvantages RAMpage more than the conventional model. 

More data is needed to understand why RAMpage with context switches on 
misses is faster in some cases in dreamy mode. A possible explanation is that 
dreamy mode, with its longer latency for DRAM accesses, loses less performance 
to context switches before a working set has loaded fully into SRAM. 



4.3 Energy Variations 

Table 2 shows the simulated DRAM energy usage for each variation. 

The lowest-energy non-dreamy case is the standard hierarchy with a 512B L2 
block size, which also has the quickest execution time. For dreamy runs, however, 
the lowest-energy case is R AMpage without context switches on misses for a 4KB 
SRAM page size, as compared with the fastest case: 2KB page size, with context 
switches on misses. The reason for this discrepancy results from the fact that in 



Table 2. DRAM energy use. Each row shows standard hierarchy (top), RAMpage 
without (middle) and with context switches on misses (bottom). 



L2 block/ 


Non-Dreamy 




Dreamy Energy 






page size 


energy (J) x best 


asleep (J) awake (J) total (J)| 


% full 


x best 


128 


1.241 


6.7 


0.0116 


0.716 


0.728 


58.7 


3.92 




1.987 


10.7 


0.0190 


0.732 


0.751 


36.9 


4.05 




2.558 


13.8 


0.0163 


0.584 


0.601 


25.8 


3.24 


256 


1.223 


6.6 


0.0115 


0.440 


0.451 


22.9 


2.43 




1.582 


8.5 


0.0151 


0.445 


0.460 


39.2 


2.48 




1.892 


10.2 


0.0135 


0.363 


0.377 


21.3 


2.03 


512 


1.220 


6.6 


0.0115 


0.304 


0.315 


39.2 


1.70 




1.385 


7.5 


0.0132 


0.300 


0.313 


21.3 


1.69 




1.548 


8.3 


0.0121 


0.242 


0.255 


22.9 


1.37 


1024 


1.232 


6.6 


0.0116 


0.250 


0.262 


21.3 


1.41 




1.305 


7.0 


0.0123 


0.228 


0.241 


22.9 


1.30 




1.376 


7.4 


0.0116 


0.211 


0.223 


39.2 


1.20 


2048 


1.276 


6.9 


0.0116 


0.281 


0.293 


22.9 


1.58 




1.276 


6.9 


0.0120 


0.184 


0.196 


20.1 


1.06 




1.303 


7.0 


0.0114 


0.193 


0.204 


44.2 


1.10 


4096 


1.458 


7.9 


0.0119 


0.560 


0.572 


39.2 


3.08 




1.293 


7.0 


0.0120 


0.174 


0.186 




14.4 


1.00 




1.332 


7.2 


0.0111 


0.280 


0.291 | 


21.9 


1.57 
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Fig. 2. All vs. Standard dreamy energy usage. 



the 4KB case, DRAM is awake for a smaller total time (lower “awake” energy). 
The time DRAM is awake depends on time that transfers take. A larger page 
size may reduce the miss rate, but total transfer time may increase. Context 
switches on misses hides this effect by doing other work on a miss. However, 
increased energy is not disguised by overlapping transfers with other work. 

Figure 2 compares energy use in dreamy mode for all variations with a break- 
down of energy use by the standard architecture. Energy use increases signifi- 
cantly for large cache block sizes in the standard architecture, which is less true 
of RAMpage variations. The reason for this behaviour of the standard model 
can be seen in Figure 2. As L2 block size increases, energy use while asleep de- 
creases, but awake energy increases, corresponding to a larger fraction of time 
being spent waiting for DRAM (as confirmed by the increase in execution time 
for the standard dreamy simulations for larger L2 block sizes, in Figure 1(a)). 

Figure 3 compares RAMpage variations. The increase in energy use for con- 
text switches on misses with a 4KB page size needs further investigation. A likely 
cause is increased contention for SRAM pages resulting from the higher context 
switch rate. Large pages (or cache blocks) are likely to have the highest benefit 
if their prefetch effect can be put to good use. 



4.4 Overall Trade-Offs 

In summary, the fastest time does not necessarily correspond to lowest energy 
use, even if the system is not operating for as long overall. The quickest dreamy 
run time was 2.57s (context switches on misses, 2KB SRAM page size), while the 
lowest-energy variant took 2.64s (4KB SRAM page size, no context switches on 
misses). The lowest-energy variant ran about 3% slower than the fastest dreamy 
variant, or 12% slower than the fastest non-dreamy variant). 
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Fig. 3. Comparison of RAMpage dreamy energy breakdown. 



Electing to run RAMpage in its fastest dreamy mode would require 10% 
more energy for DRAM than its fastest variant. However, the overall fastest 
version (conventional, 512B L2 blocks) needs 6.6 times the energy of the most 
energy-efficient version, or 6 times the energy of the fastest dreamy variation. 

A designer therefore can balance choices between maximum speed (no dreamy 
mode, standard two-level cache) and maximum energy saving (RAMpage with- 
out context switches on misses, SRAM page size chosen for lowest energy). As 
a compromise, it would be possible to use RAMpage with context switches on 
misses, with sub-optimal energy use, but better performance. 

These speed-energy trade-offs only represent DRAM energy. The CPU and 
SRAM modelled use 1W (0.2W and 0.8W, respectively). For a run time of 2.57s, 
the pair uses 2.57J whereas over the best run time of 2.36s, the total goes 
to 2.36J. This difference is easily justified by saving over 1J in DRAM energy 
but, nonetheless, a more comprehensive energy analysis of the whole system is 
needed. For example, PADRAM runs uses more energy in its equivalent of a 
simply dreamy mode than without [11], probably because of its relatively small 
L2 cache. The relatively large L2 used here, on the other hand, uses more energy 
than one would like for a low-energy design. 

5 Conclusion 

5.1 Introduction 

This paper has presented an initial study of use of the RAMpage memory hi- 
erarchy to reduce DRAM energy usage. The approach used was to simulate a 
dreamy memory, in which DRAM is turned off except when referenced. The mo- 
tivation for this study is previous results which have showed RAMpage to be 
more tolerant of increased DRAM latency than a conventional hierarchy. 
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The remainder of this section summarizes results, outlines future work and 
presents overall conclusions. 



5.2 Summary of Results 

RAMpage, with the option of context switches on misses, presents some useful 
trade-offs in choosing an energy-speed design trade-off. Assuming a relatively 
low-energy processor design (as well as low-energy components for the remainder 
of the system), dreamy energy savings could be significant. The fastest config- 
uration uses almost 7 times the energy of the most energy-efficient one, for a 
performance gain of only 12%. The performance cost of dreamy mode can be 
brought down to 9% by a relatively modest compromise on energy saving: this 
dreamy configuration still uses a sixth of the energy of the fastest version. 

The best overall compromise is achieved by RAMpage with context switches 
on misses, though by a less significant margin than in earlier studies, which 
showed this variant to be most tolerant of high DRAM latencies [14] . A relatively 
small SRAM layer makes RAMpage less competitive than in these earlier studies. 



5.3 Future Work 

It is important that, while the savings were achieved with modest speed loss, 
overall energy usage should take into account other parts of the system, which 
would use more energy if left in full power mode for a longer time. If energy for 
the processor and L2 are added in, the fastest dreamy version also has the lowest 
overall energy by a small margin 2.78J versus 2.82J for the version with lowest 
DRAM energy). This should be compared against 3.58J for the best non-dreamy 
version, a saving of 29% which is useful but not as dramatic as a factor of 6. 

Results should be extended to a more detailed analysis of overall system 
energy, including low-energy variations on caches, and low-energy versions of 
faster processors. The simulated SRAM has a relatively low latency for waking 
up from low-power mode (50% of the latency of an L2 hit). Since 1MB SRAM 
(0.8W) uses more power than DRAM in full-power mode (0.52W) - as simulated 
here - this would be a useful variation to explore. 

A RAMpage implementation on the L4 Pistachio kernel [5] is planned. This 
kernel is small enough to permit implementation of its minimum memory- 
resident data and code in the 64KB static RAM memory in the ARM11 family. 
Using this extra SRAM would also make it viable to implement RAMpage with 
a smaller SRAM main memory, a significant factor in the overall energy budget 
of this kind of system. L4 has been ported to the M5 architecture simulator [3] 
by the NICTA group at University of New South Wales, creating the possibility 
of RAMpage on a full-system simulator, a goal of earlier work. 

The existing simulator will be used to experiment with further variations 
on energy-efficient memories. For example, instead of an SRAM main memory, 
main memory could be implemented in a small fast permanently powered up 
DRAM, with the remaining DRAM operating as a dreamy paging device. 
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5.4 Overall Conclusion 

In this latest study, investigating dreamy memory model has shown the potential 
for RAMpage in low-energy designs. While RAMpage did not run in the shortest 
time in full-power mode (expected with a relatively slow processor) , it did have 
both the fastest and lowest-energy measurements in dreamy mode. 

Results showed a fair fraction of the potential energy gain: full power mode 
needed 100 times that of self-refresh mode; the lowest-energy case used less than 
a sixth of full DRAM power. The aim of achieving close to an average of self- 
refresh power use with as close as possible to full speed has been partially met. 
More sophisticated approaches to minimizing power use (e.g., keeping power on 
for a period after a reference, exploiting a wider range of low-power modes, and 
dynamic page placement policies, as in PAVM) could further reduce energy use. 

The design trade-offs discussed here represent a starting point: overall low- 
energy system design requires design of the whole system to minimise energy 
use. Just as Amdahl’s Law shows that focus on one area of speed improvement 
has diminishing returns, we need to be careful not to interpret energy savings in 
isolation. Nonetheless RAMpage shows promise in the area of low-energy design, 
and this study will be followed up with others. 

Acknowledgements. Financial support for this work has been received from 
the University of Queensland. I would like to thank Gernot Heiser for proposing 
that I investigate energy management using RAMpage. 
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Abstract. Traditional dynamic voltage scaling algorithms periodically 
monitor CPU utilization and adapt its operating frequency to the up- 
coming performance requirement for CPU power management. Predict- 
ing CPU utilization is usually conducted by estimating upcoming perfor- 
mance requirement. In order for dynamic voltage scaling algorithms to be 
effective, the prediction accuracy of CPU utilization must be high. This 
paper proposes a power management algorithm that improves accuracy 
of predicting future CPU utilization using process state information. Ex- 
periments show that the proposed algorithm reduces power consumption 
by ll%-57% without any performance degradation. 



1 Introduction 

In mobile battery-powered systems, power is considered as a precious resource, 
and a CPU is known to consume more than 50 % of the whole system power [1], 
Therefore, efficient CPU power management is required to reduce the power 
consumption of a system. In CPUs based on CMOS logic, the peak frequency 
is proportional to the supply voltage and power is proportional to the square 
of the supply voltage. Dynamic voltage scaling (DVS) has been implemented in 
most CPUs in order to control power consumption by dynamically changing its 
operating frequency. 

Dynamic power management through DVS is classified into two approaches: 
an intra task approach and an inter task approach according to the location of 
the power management algorithm ( i.e ., inside of a task or outside of a task). In 
intra-task approaches, compiler or software tool analyzes a task and determines 
when a CPU frequency has to be changed [2]. This approach can adjust the CPU 
frequency with considerable accuracy because the performance requirement of a 
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task is analyzed in advance. However, this approach is impractical because all 
applications have to be modified. Inter-task approach is divided into three ap- 
proaches based on the type of information used for power management. The first 
approach uses the deadlines of tasks [3,4]. It achieves power reduction by consid- 
ering the worst case execution time of a task and uses that information to exploit 
the slack time generated by the scheduler However, it requires the task deadline 
information, which reduces its applicability. The second approach uses the tasks’ 
characteristics obtained from the analysis of events such as system calls and task 
creation/exit/switch [5]. This approach allows different power management al- 
gorithms to be applied to each task according to their characteristics. However, 
the overhead of event monitoring and analysis may be significant. For example, 
it would cost about 1% - 4% of the total CPU cycles, in order to monitor sys- 
tem calls and scheduler of the kernel on Transmeta Crusoe’s CPU [5] . The third 
approach periodically monitors the CPU utilization, and uses that information 
to predict the expected CPU utilization [6,7]. The CPU frequency is changed 
adaptively to the predicted CPU utilization. Because this operation must be re- 
peated periodically, it is called an interval-based approach. This approach pro- 
vides higher applicability than other approaches while achieving simplicity. How- 
ever, this approach could suffer from inefficiency due to its inaccurate prediction. 
We consider the interval-based approach promising because of its simplicity. 

A careful investigation of existing interval-based approaches leads us to iden- 
tify the reason for its inaccurate prediction. In Figure 1, we assume that there 
are ouly three processes Pi, P 2 and P3 and their utilizations are 0.2, 0.3, and 0.3 
respectively, on each run. At every interval, the CPU utilization for the next in- 
terval is computed as the exponential average (EXP) of the previous and current 
CPU utilizations [8]. As shown in Figure 1(a), there exists a large gap between 
the actual CPU utilization and the CPU utilization predicted by the EXP. We 
think this is mainly caused by the fact that we computed EXP over all processes 
regardless of their state. In Figure 1(b), after the fourth interval, Pi and P 2 exist 
in the request queue (RQ). EXP predicts that Pi’s utilization is 0.2 and P 2 ’s 
utilization is 0.3, and computes the CPU utilization of the fifth interval as 0.5. 
Thus, considering the states of processes in the computation of EXP improves 
the prediction accuracy of the CPU utilization. Table 1 shows that there is a 
very low probability of a process to remain in the ready-to-run state for two 
consecutive intervals or more. 



Table 1 . Probability of a process to remain in ready-to-run state for two consecutive 
intervals or more: for example, the value 0.52 implies about half of the ready-to-run 
processes in current interval will remain in ready-to-run state until the next interval 



Application type 


Fraction 


interactive (xpdf) 


0.52 


multimedia (mplayer) 


0.34 


I/O intensive (ubench/fsdisk) 


0.93 (the fraction of cycles in which CPU is idle is 0.53) 
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Interval 



(a) Based on CPU utilization of all processes 
executed in the past intervals 




Interval 



(b) Based on CPU utilization of processes in 
the ready queue (RQ) 



Fig. 1 . Predicting CPU utilization 



This paper proposes a dynamic power management algorithm that improves 
the prediction accuracy of existing interval-based power management algorithms 
using process information. The remainder of this paper is organized as follows. 
Section 2 describes the proposed algorithm. Section 3 evaluates the amount 
of power consumption of the proposed algorithm and existing interval-based 
algorithms under various types of workloads. Finally, the conclusion is presented 
in Section 4. 



2 The Proposed Algorithm 

In this section, we propose an algorithm that improves the efficiency of a dy- 
namic power management by only considering the processes that are likely going 




dDVS: An Efficient Dynamic Voltage Scaling Algorithm 163 



Table 2. Symbols and their meanings 



Symbol 


Meaning 


Q 


the length of time quantum 


I 


the length of an interval 


PUP 


per-Process Utilization Predictors such as PAST, EXP and PD 


f min 


the minimum CPU frequency 


fn/e-i 


the actual/estimated execution time of process i in the current interval 


Ui/u* 


actual/estimated utilization of process i in the current interval 


Ui-/U*- 


actual/estimated utilization of process i in the previous interval 


f/r 


CPU frequency in current /next interval 


p 


a set of processes executed in the current interval 


p * 


a set of processes to be executed with a high probability in the next interval 



Input : P, f Output : /* 

1. Update the execution time and utilization of all processes executed in the current 
interval 
; e P 

d = d x jJ- — / / for normalizing utilization by fmax / / 




2. Estimate CPU utilization in the next interval 

u* = Ev PiS p* PUP(pt), 

where P* = {pk\pkis a process that is going to be executed in the next interval} 
and |P*| < // 

3. Compute CPU frequency for next interval 

f* = U*x f max 



Fig. 2. Description of the proposed algorithm 



to be executed in the next interval. This algorithm periodically keeps track of 
utilizations of each process executed in previous intervals, finds the processes to 
be executed in the next interval, and predicts the future CPU utilization as the 
sum of their utilizations estimated by an existing interval-based approach. Thus, 
the proposed algorithm requires process information, such as the process’s state, 
its scheduling priority and its past utilizations. Table 2 shows the description of 
the notations to be used in the subsequent section. 

Figure 2 describes the proposed algorithm. First, it updates the utilization of 
all processes executed in the current interval. Utilization of process i is equal to 
y . The execution time of each process in this algorithm should be normalized 
by the maximum frequency f max because it is dependent on CPU frequency 
when measured. For example, when the maximum frequency is 600 MHz and a 
process was executed at 300 MHz for 10 ms, the execution time of the process 
is 5 ms. Next, the proposed algorithm selects processes that are highly likely 
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going to be executed. A process has three different states: READY to wait for 
run, WAIT to wait for I/O’s completion or signals, and RUN to be executed 
on a CPU. The kernel’s scheduler chooses a process with the highest scheduling 
priority from the set containing all processes with a ready state. Thus, the next 
process to be executed can be predicted by checking the priority of processes 
before its execution. Therefore, the proposed algorithm chooses |~^"| processes 
with high scheduling priority from the set of processes with READY state. The 
parameter |~^] is chosen empirically. Next, it estimates the utilizations of each 
chosen processes. In order to predict them, it uses a per-Process Utilization Pre- 
dictor (PUP). PUP predicts the utilization of a process on the next interval by 
using its past utilizations. Any interval-based approach can be used as PUP. 
Next, the proposed algorithm estimates the CPU utilization as the sum of the 
utilizations of each process estimated by PUP. Finally, it computes the CPU 
frequency in the next interval. Because the CPU utilization from the previous 
step is based on the minimum frequency, the CPU frequency in the next interval 
is equal to the multiplication of CPU utilization and the minimum frequency. 
The proposed algorithm applies existing interval-based approaches to each pro- 
cesses to be executed in the next interval instead of entire processes executed 
in the past intervals. Because it only considers processes with READY state, 
the creation/exit and state change of a process affects its prediction. Thus, in 
the cases where the state of a process often varies or its lifetime is short, this 
algorithm largely reduces the power consumption compared to existing interval- 
based approaches. By contrast, in the case of a process that seldom changes its 
state, both this algorithm and existing interval-based approaches show a similar 
performance. 

3 Performance Evaluation 

3.1 Experimental Environment 

The proposed algorithm has been evaluated on a Sony VAIO-C1VJ notebook 
equipped with Transmeta’s Crusoe CPU. The CPU provides four pairs of differ- 
ent frequency and voltage levels, which are (300 MHz, 1.3 V), (400 MHz, 1.35 V), 
(500 MHz, 1.4 V), and (600 MHz, 1.6 V). Thus, the maximum CPU frequency is 
600 MHz ( fmax = 600). To set the frequency and voltage, TM5600 stays in the 
sleep mode for 20 /zsec and consumes 2 /.tJ-4 /iJ energy. The experimental results 
in this paper includes the overheads such as time and power. PAST, EXP, and 
Proportional-Differential (PD) are used as PUPs of the proposed algorithm. PD 
is one of the traditional control theories and estimates the change concerning the 
direction of a slope [9]. PAST assumes that the future CPU utilization will be 
the same as the previous one. Table 3 shows the formulas of PUP when PAST, 
PD and EXP are used as the PUPs of the proposed algorithm. 

The proposed algorithm is implemented within Linux kernel version 2.5.58 
like Figure 3. The length of time quantum in Linux kernel is 10 ms ( Q = 10). We 
added a new data structure, called the utilization history table, to keep track of 
the CPU utilization of each process. A process is allocated an entry in the table 
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I : length of an interval 
Q: length of time quantum 




ready queue sorted by 
scheduling priority 

Fig. 3. Implementation of the proposed algorithm within Linux kernel 



Table 3. Formulas of PD, PAST and EXP being used as a per-Process Utilization 
Predictors 





PD 


U* = Ui + K P ( 1 — Mi) + Kd{{ 1 — Mi) — (1 — Mi-) 






(Kp = -3 and K D = -3) 


PUP 


PAST 


M* = Ui 




EXP 


u* = aui + (1 — q)m*_ 






(q = 0.5) 



right after its creation and releases the entry before its termination. Each entry 
of the table contains three fields: valid-flag, curr-util and prev-util. valid-flag 
indicates whether the corresponding information is valid or not, curr-util refers to 
the current utilization and prev-util holds the utilization of the previous periods. 
CPU utilization information is updated in the timer interrupt handler. The timer 
interrupt handler updates curr-utiloi the current process. This method has little 
computational overhead. More precise CPU utilization can be measured if it is 
updated every time processes change their states. A new kernel process is created 
to carry out the proposed power management. It not only periodically monitors 
the state of each process, but also adjusts the CPU frequency accordingly based 
on the predicted CPU utilization. 

Table 4 shows the execution time of an application and the computing over- 
head when the proposed algorithm runs at different monitoring intervals. The 
computing overhead is calculated by taking a ratio of the execution time of the 
power management code to the total execution time. The interval length, de- 
noted by /, is the interval at which the proposed power management code is 
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Table 4. The execution time of an application and the computing overhead when the 
proposed algorithm runs at different monitoring intervals 



interval length (ms) 


| computing overhead (%) 


| execution time (sec) 


10 


0.67 


89.9 


50 


0.12 


90.1 


100 


0.06 


90.0 


200 


0.02 


124.1 



executed. With an extremely short monitoring interval, the computing overhead 
of the algorithm increases, but the accuracy of prediction becomes high enough. 
For example, the 10 msec monitoring interval entails the 0.67% computing over- 
head shown in Table 4. By contrast, with an extremely long monitoring interval, 
the prediction accuracy becomes too low. For instance, Table 4 reveals that 
the 200 msec monitoring interval prolongs the execution time of the application 
by about 34 seconds, due to its inaccurate prediction. Because the number of 
processes on READY queue at any given moment is generally less than q , so 
the estimated CPU utilization is prone to be less than the practical CPU uti- 
lization. The most desirable monitoring interval is observed to be 100 msec in 
Table 4. Note that the 100 msec monitoring interval will be used in subsequent 
experiments (I = 100). 

Three types of applications have been used to verify the efficiency of the pro- 
posed algorithm. The first type is a set of I/O intensive applications. They are 
Ubench4.0.1/fsdisk and fstime. 1 The second type is a set of interactive applica- 
tions. In order to provide identical set of inputs to the applications we logged 
our desired set of input using “Interactive Linux application Benchmark” 2 before 
carrying on with the experiments involving XPDF and LLL-1.4. 3 The third stage 
of the experiment was conducted using the Mplayer (Linux’s MPEG player 4 ), 
and Xmms (Linux’s MP3 player 5 ). In order to make a fair comparison between 
the power management policies, a MPEG video clip and a MP3 file are deliber- 
ately chosen so that no frame drops will occur under any policies. 



3.2 Experimental Results 

In order to show the prediction accuracy improvement, the proposed algorithm 
was compared to PAST, EXP, and PD. In the remaining sections, we abbrevi- 
ated the proposed algorithm by prefixing the name of its PUP with ‘P’, that 
is, PPAST, PEXP, and PPD. The experimental results of each application are 
appraised by considering the power saving gains, performance impact and the 

1 http://www.tux.org/pub/tux/benchmarks/System/unixbench, Unix Benchmark 

" http://opensource.nus.edu.sg/~ctk/benchmark/bench.html, Benchmarking of 
interactive linux applications 

3 http://users.pandora.be/thomas.raes/LSS/lss.html, Linux Lunar Lander 

4 http://www.mplayerhq.hu/homepage/design6/info.html, Mplayer 

5 http : / /www . xmms . org, XMMS 
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Table 5. Consumed power, execution time and computing overhead of power manage- 
ment policies during the execution of an I/O application 





fsdisk 1 


j fstime j 




consumed 


execution 


computing 


consumed 


execution 


computing 


policy 


power 


time 


overhead 


power 


time 


overhead 






(second) 


(fj, second) 




(second) 


(/j, second) 


PD 


0.59 


130.6 


41.67 


0.64 


131.1 


33.34 


PPD 


0.50 


131.0 


63.82 


0.54 


130.1 


52.76 


EXP 


0.77 


131.2 


18.49 


0.84 


129.8 


15.26 


PEXP 


0.33 


131.2 


49.01 


0.41 


130.0 


53.65 


PAST 


0.52 


131.3 


18.55 


0.54 


129.9 


14.81 


PPAST 


0.33 


132.7 


45.51 


0.38 


130.0 


52.31 



Table 6. Consumed power, execution time and computing overhead of power manage- 
ment policies during the execution of an interactive application 





xpdf | 


j LLL j 




consumed 


execution 


computing 


consumed 


execution 


computing 


policy 


power 


time 


overhead 


power 


time 


overhead 






(second) 


(fj, second) 




(second) 


(/^second) 


PD 


0.64 


115.7 


31.22 


0.69 


23.8 


70.39 


PPD 


0.56 


119.4 


46.09 


0.51 


24.4 


96.98 


EXP 


0.83 


111.1 


13.93 


0.89 


25.5 


25.87 


PEXP 


0.38 


123.0 


37.88 


0.49 


29.5 


75.99 


PAST 


0.42 


143.9 


15.06 


0.61 


25.6 


26.41 


PPAST 


0.62 


115.1 


37.24 


0.39 


26.9 


91.23 



computing overhead. When an application runs without any power management 
policy (NPM), we assumed that the power consumption of NPM is 1. Thus, 
the consumed power is normalized by that of NPM. The computing overhead is 
calculated by taking the average of execution times of the power management 
code over all intervals. 

As shown in Table 5, the performance of I/O applications can be assessed by 
their execution time. PPD, PEXP, and PPAST achieved a 15%-17%, 51%-57% 
and 30%-37% power reduction over PD, EXP, and PAST respectively. An I/O 
intensive process repeats the WAIT-READY-RUN cycle to handle I/O requests. 
By considering the current states of the processes, we can figure out when an 
I/O intensive process is going to take up a portion of the CPU time. This results 
in reduced power consumption of I/O applications by 15%-57% without delay 
in execution time. 

Table 6 shows the experimental results when an interactive application, xpdf 
or LLL, generates workload. In general, the proposed algorithm reduces power 
consumption by 12.5%-54%. However, in the case of xpdf, PPAST consumes 
more power than PAST. Note that the PAST increases the execution time of 
the xpdf compared to other policies by 32.8 seconds in the worst case. This is 
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Table 7. Consumed power, execution time and computing overhead of power manage- 
ment policies during the execution of an multimedia application 



policy 


mplayer j 


xmms 1 


consumed 

power 


computing 
overhead 
(fj, second) 


consumed 

power 


computing 

overhead 

(/^second) 


PD 


0.56 


57.28 


0.43 


25.33 


PPD 


0.50 


83.57 


0.36 


49.70 


EXP 


0.81 


25.61 


0.65 


34.60 


PEXP 


0.42 


69.53 


0.35 


46.83 


PAST 


0.64 


25.68 


0.47 


38.70 


PPAST 


0.41 


71.80 


0.37 


50.92 



a severe performance degradation. It seems that PAST predicts a lower CPU 
utilization than the actual CPU utilization. 

Table 7 shows the experimental results when the multimedia file chosen is 
played with any frame drops. The proposed algorithm diminished the power con- 
sumption by ll%-48% over PAST, EXP, and PD. However, the power reduction 
rate of the multimedia application is less than that of the other applications. Be- 
cause the multimedia application process stays in READY and RUN state after 
its creation, a slight gap in the power reduction rate occurs. 

As shown in the experimental results of Table 5-6, the proposed algorithm 
achieves a large power reductions over existing interval-based approaches, such 
as PAST, EXP and PD. Although the computing overhead increases to 3.5 times 
in the worst case, each application runs without any delay in its execution time 
and the power consumption also decreases as well. 



4 Conclusion and Future Work 

This paper proposed an efficient power management algorithm in order to im- 
prove prediction accuracy using processes’ state information. The experiments 
with the proposed algorithm revealed that the power consumption of the pro- 
posed algorithm is reduced by ll%-57% when compared to the existing interval- 
based algorithms such as PD, EXP and PAST. However, the efficiency of the 
proposed algorithm is depended on the length of the monitoring interval. Choos- 
ing the optimal interval length for each type of applications still remains as our 
future work. 
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Abstract. Value predictor predicting result of instruction before real 
execution to exceed the data flow limit, redundant operation table remov- 
ing redundant computation dynamically, and asynchronous bus avoiding 
clock synchronization problem have been proposed as high performance 
microprocessor design methods. However, these methods increase area 
cost and power consumption problems because of the larger table for 
value predictor and redundant operation table, and the higher switch- 
ing activity in asynchronous bus. To resolve the problems of data tables 
for value predictor and redundant operation table, we have investigated 
partial tag and narrow-width operand methods, which have been re- 
cently proposed separately and present an efficient update method for 
value predictor and a table organization method for redundant opera- 
tion table, respectively. To reduce excessive switching activity of asyn- 
chronous bus, we also propose a bus encoding method using frequent 
value cache, which reduces the same data transmissions. The proposed 
three methods - an efficient update method for value predictor, a ta- 
ble organization method for redundant operation table, and a frequent 
value cache for asynchronous bus - exploit information locality such as 
instruction and data locality as well as data redundancy. Analysis with a 
conventional microprocessor model show that the proposed three meth- 
ods reduce total area cost and power consumption by about 18.2% and 
26.5%, respectively, with negligible performance variance. 



P.-C. Yew and J. Xue (Eds.): ACSAC 2004, LNCS 3189, pp. 170-184, 2004. 
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1 Introduction 

Until a few years ago, performance improvement has been a key research issue 
in microprocessor design. Recently, however, the area cost and the power con- 
sumption of a microprocessor have been increased drastically as the number of 
transistors keeps increasing. As a result, research interest has been shifted to per- 
formance improvement while maintaining the efficiency of area cost and power 
consumption. In this paper, several design methods have been investigated for 
a high performance microprocessor with an emphasis on achieving efficient area 
cost and power consumption. 

Among many design techniques for a high performance microprocessor, three 
methods are investigated such as value predictor, redundant operation table, and 
asynchronous dual-rail bus in this research. The value predictor predicts a result 
of an instruction before the instruction is actually executed. Hence dependent 
instructions can be executed at the same time when the instruction is executed. 
On the other hand, the redundant operation table stores recently executed in- 
structions in a table and checks whether the current executable instruction is 
already stored in the table. In other words, the redundant operation table can 
skip the real execution of an instruction by a simple lookup procedure with the 
table, subsequently shortening the execution time of the instruction. Another 
alternative design technique, the asynchronous dual-rail bus is a reliable bus 
scheme for a complex system such as a futuristic high performance microproces- 
sor. The asynchronous dual-rail bus can transmit data in a reliable fashion by 
making use of the dual-rail encoding, which combines the data and the control 
signals. 

Analyzing the aforementioned three design methods from the area cost and 
power consumption points of view, several attempts are made especially to find 
some locality and redundancy of data used in each design method. Several infor- 
mation localities and data redundancies were found, which causes extra area cost 
and power consumption. More specifically, the value predictor and the redun- 
dant operation table store the same or a little different instructions (instruction 
locality), small operand values (operand data locality), and small result values 
(result data locality), whereas the asynchronous dual-rail bus transmits the same 
data items repeatedly (communication data locality). From what we observed 
about these localities, a conclusion was reached that each design method can be 
further enhanced for lower area cost and lower power consumption by exploiting 
such localities to reduce redundancy. 

In this paper, we propose three enhanced methods as follows. First, for value 
predictors, we propose a method to combine the two previously proposed area 
cost reduction methods such as partial-tag and narrow-width methods. Second, 
we designed a partial resolution method to reduce the area cost of the tag fields 
in the redundant operation table. Third, we applied the previously proposed 
frequent value cache method into an asynchronous dual-rail bus to minimize the 
communication data redundancy. 

As the last step, we investigated total area cost and power consumption 
reduction effects in a conventional microprocessor model. By using the proposed 
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methods, the total area cost and power consumption in a microprocessor model 
would be reduced by about 18.2% and 26.5%, respectively. 

This paper is organized as follows. Section 2 describes related work as three 
high performance design methods, information locality, and data redundancy. 
The proposed area and power reduction methods for value predictor and re- 
dundant operation table are described in Section 3 and 4, respectively. Also a 
designed power reduction method for asynchronous dual-rail bus is explained in 
Section 5. Meanwhile, total area cost and power consumption reduction effects 
in a microprocessor model are analyzed in Section 6. Section 7 concludes this 
research. 



2 Related Work 

2.1 High Performance Design Methods 

Value Predictor: Value predictors have been proposed to overcome the data 
dependency problems in the instruction-level parallelism by predicting a result 
value of an instruction before its actual execution [1], [2]. 

Redundant Operation Table: When the instruction-level parallelism in- 
creases, there are many side effects. One of the side effects is the increased 
number of redundant executions because of speculative executions due to branch 
predictor or value predictor. Unfortunately, speculative or redundant operations 
limit the performance improvement and increase the power consumption as well 
[3]. To overcome such negative effects, many optimization methods have been 
proposed [4], [5]. One typical solution is eliminating redundant operations, where 
redundant executions of complex operations are replaced by simple table lookup 
operations [6]. 

Asynchronous Dual-Rail Bus: Because of the steady increase of the number 
of components in a chip, SOC design methods have been studied intensively 
and will be used for a futuristic high performance microprocessor. To succeed 
in the market, the time-to-market and the reliability of a SOC are very impor- 
tant. To help the design efforts for a short design time and reliability of SOCs, 
asynchronous design methods [7] have been studied recently. For a reliable asyn- 
chronous bus structure in SOC designs, the dual-rail data encoding method [8] 
has been intensively investigated. 



2.2 Information Locality and Data Redundancy 

Information Locality: Information localities in a microprocessor are defined, 
which are related to instructions, operands of instructions, results of instruc- 
tions, and communication data over bus. First, instruction locality is defined as 
a small number of instructions is repeatedly or frequently executed, and usu- 
ally the instructions are located closely to each other in a given time interval. 
Second, operand locality is defined as the data value of the operand is small 
in most instructions and can be represented with small number of bits. Third, 
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result locality is defined as the results of most instructions are small which can 
be represented with small number of bits. Last, communication data locality is 
defined as a bus transmits the same or very similar data repeatedly or frequently 
in a given time interval. 

Data Redundancy: Considering the above information localities, we can infer 
that there are data redundancy in instructions, operands of instructions, results 
of instructions, and communication data over bus, respectively. First, data re- 
dundancy of instructions is occurred when the instruction addresses in a given 
time interval are not so different, which can be inferred that the higher bits of 
instruction address are the same. Hence the higher bits of addresses of executed 
instructions in a given time interval are redundant. Second, data redundancy of 
operands is occurred when most operands of executed instructions in a given time 
interval require a small number of bits, and hence the higher bits of operands 
are considered as redundant bits. Third, data redundancy of results is occurred 
when the most results of executed instructions in a given time interval require a 
small number of bits, and hence the higher bits of results are redundant. Last, 
data redundancy of communication data is occurred when the most communi- 
cation data in a given time interval are the same or similar, and hence most 
communications are redundant. 

3 Value Predictor 

3.1 Table Structure 

In this research, we explain only the stride predictor for the simplicity. The stride 
predictor assumes that consecutive result values of an instruction have the same 
stride value [1]. Usually, a value predictor exploits a large data table to store 
required information and is referenced by the instruction address. 

3.2 Combining Partial Tag and Narrow-Width Operand Method 

Two Methods to Reduce Area and Power of Value Predictor: To reduce 
the area cost and power consumption of value predictor, two methods have been 
already proposed as follows. 

Partial Tag Method: Instruction or data caches are usually based on a correct 
association between an instruction address and an indexed entry because the 
lookup data must be the same value as the previously stored value. In the value 
predictor, however, a lookup data is a prediction value so that it does not always 
require the correct association between a lookup address and an indexed data. 
Based on such a loose association between a lookup address and an indexed data, 
a value predictor does not necessarily use a full-tag, but can use a partial-tag, 
which reduces the area cost of the tag part [9] . Briefly, the full-tag method takes 
an address as a tag except for index bits, but the partial-tag method only uses 
some part of a full-tag. 

Narrow- Width Operand Method: Analysis of the result values of a program shows 
that only a few result values require a full precision value supported by processor 
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Fig. 1. Combining Method of Partial- Tag and Narrow-Width Methods 



registers. Taking into account such locality, the narrow- width operand method 
classifies result values into two types as the narrow-width and the wide-widtlr 
result values according to the required number of bits [10]. For the purpose of 
area cost reduction in data tables, the narrow-width operand method utilizes 
both the narrow-width and wide-widtlr tables for storing the narrow-width and 
wide-widtlr result values, respectively. If a result value of an instruction requires 
fewer bits than the predetermined number of bits, prediction information of the 
instruction is stored in the narrow- width table. Otherwise, prediction informa- 
tion is stored in the wide-widtlr table. Because the narrow-width table stores 
fewer bits for each result value, it reduces the overall area cost of a data table. 
Combining Partial Tag and Narrow-Width Operand Methods: To date, 
two area cost reduction methods for value predictors have been proposed inde- 
pendently. In the present research, a combining method with an efficient table- 
update method is proposed to minimize the performance degradation. A simple 
method combining these two methods is conceivable. However, such a simple su- 
perposition method decreases the performance improvement ratio because two 
prediction values are generated from the two tables. 

We propose a new table-update method as shown in Figure 1. When the result 
of an instruction is classified into a narrow-width result (wide-widtlr result), the 
instruction is stored in the narrow- width table(wide- width table). At the same 
time, the wide-widtlr table (narrow- width table) invalidates an indexed entry if 
the entry contains the same partial-tag with the instruction. In short, depending 
on the classification result of an instruction, only one of the two tables stores 
the instruction, and the other table must invalidate a corresponding entry if the 
tag is the same with the referenced address. 



3.3 Analysis 

To measure the effect of the proposed area reduction method, the die size and 
the power consumption of value predictor are measured by using CACTI 3.2 
[11]. We also investigated IPC value when the proposed method is used with a 
SimpleScalar [12] model and SPEC 95 [13] benchmark programs. However, as 
we expected, the IPC value changes very little about 1%. Hence we skip the 
explanation of IPC variation when the proposed method is used. 

Area Cost: Table 1 describes area cost reduction ratios over the conventional 
stride predictor. The reduction of area cost is higher with the narrow- width 
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Table 1 . Area Cost and Power Consumption Reduction Ratios with Different Area 
Reduction Methods for Stride Predictor 



Area Reduction Methods 


Area Cost 


Power Consumption 


Number of Entries (K) 


Number of Entries (K) 


3 2 


re 


8 


4 


3 T 


re 


8 


4 


Narrow- Width 


57% 


55% 


56% 


47% 


16% 


17% 


27% 


24% 


Partial- Tag 


“20% 


T9% 


^0% 


~ 20 % 


~39% 




~46% 




Proposed Combining Method 


7 2% 


74% 


7 2% 




T8% 


~51% 


W% 


~68% 



method than with the partial-tag method. The reason is as follows. The reduc- 
tion ratio depends on the portion of the area cost reduced by the partial-tag and 
the narrow-width methods. The partial-tag method can decrease the area cost of 
the tag part only; however, the narrow-width method can decrease the area cost 
of all result values. Meanwhile, the proposed combining method decreases the 
area cost more than other area cost reduction methods. The proposed combining 
method decreases the area cost by about 71% for the stride predictor. 

Power Consumption: Table 1 also describes power consumption reduction ratios 
over the conventional stride predictor. The reduction of power consumption is 
higher with the partial-tag method than with the narrow-width method. The 
reason is as follows. The power consumption of the tag part is larger than that 
of data part since each tag comparison requires more power consumption. Mean- 
while, the proposed combining method decreases the power consumption more 
than other area cost reduction methods. The proposed combining method re- 
duces the power consumption by about 61% for the stride predictor. 



4 Redundant Operation Table 

4.1 Table Structure 

In a redundant operation table, operands are partitioned into two parts: an index 
and a tag parts. Meanwhile, all operations are classified into integer or floating- 
point operations. Hence redundant operation tables have different structures 
depending upon the operation type. 

4.2 Narrow-Wide-Width Table 

A preliminary analysis of operands for integer and floating-point operations in a 
SimpleScalar [12] microprocessor with SPEC [13] benchmarks reveals that most 
operands can be represented with a small number of bits. A partial resolution 
method is proposed to exploit this characteristics. The partial resolution method 
eliminates the area cost to store redundant bits for consecutive Os in the higher 
bits for integer operands and the lower bits for floating-point operands in the 
conventional wide-widtlr redundant operation table. A wide-narrow-widtlr redun- 
dant operation table utilizing the partial resolution method is designed as shown 
in Figure 2. The wide-narrow-width redundant operation table dynamically clas- 
sifies operations into wide-widtlr and narrow-width operations depending on the 
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Fig. 2. Wide-Narrow-Width Redundant Operation Table 



operand bit width. When the operation requires narrow- width operands, the in- 
struction is stored in the narrow-width redundant operation table. Otherwise, the 
instruction is stored in the wide-widtlr redundant operation table. Note that the 
concept of the partial resolution method is similar to the partial-tag method [9], 
which is proposed to apply for value predictors. The partial-tag method for value 
predictors stores imprecise tag information, but the partial resolution method 
for redundant operation table should store precise tag information. Hence, the 
partial-tag method for value predictor cannot be directly used for the redundant 
operation table. 



4.3 Analysis 

Note that we also investigated IPC value when the proposed method is used with 
a SimpleScalar [12] model and SPEC 95 [13] benchmark programs. However, 
since the IPC variance is very little, we skip the explanation of IPC variation 
when the proposed method is used. 

Area Cost: The area cost of the conventional wide-width redundant operation 
table can be calculated easily. Meanwhile, the wide-narrow-width redundant op- 
eration table consists of two subsidiary predictors, hence the area cost of it is 
calculated by the summation of each area cost for narrow-width and wide-width 
tables. Based on the above considerations and methods, the relative area cost 
is measured as shown in Table 2. Note that the models containing above 512 
entries are measured, since the redundant operation table usually requires many 
entries. As the table explains, the proposed partial resolution method reduces 
the area cost by 20%, for FP 2048-entry, at the maximum. 

Power Consumption: Since the conventional wide-width redundant operation 
table is referred for all lookups, it can be easily calculated the dynamic power 
consumption of the wide-width table. On the other hand, since each subsidiary 
table in the wide-narrow-width redundant operation table is referred with differ- 
ent lookup ratios, the lookup ratio of each table should be considered. Hence the 
total dynamic power consumption of the proposed wide-narrow-width redundant 
operation table is calculated by the summation of each power consumption of 
narrow-width and wide-width tables considering each lookup ratios. Based on 
the above considerations and methods, the relative dynamic power consump- 
tion reduction ratio is measured as shown in Table 2. As the table explains, the 
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Table 2. Relative Area Cost and Power Consumption Reduction Ratio 



Reduction Ratio over Wide- Width Table 


Number ol Entries | 


2U5t> 


oTZ 


Area Cost 


INT 


7% 


9% 


FP 


~20% 


W% 


Power Consumption 


INT 




24% 


FP 


~m% 


31% 




Fig. 3. Frequent Value Cache augmented Bus Scheme 



proposed partial resolution method reduces the dynamic power consumption by 
about 34%, for INT 2048-entry, at the maximum. 

5 Asynchronous Dual-Rail Bus 

5.1 Frequent Value Cache 

One-of-four data encoding method reduces the power consumption of the dual- 
rail encoding method by decreasing switching activities [14]. Meanwhile, the data 
pattern analysis illustrates that many data items are repeatedly transmitted in 
accordance with the result in [15]. Hence we can conclude that the conventional 
dual-rail and the previously proposed one-of-four data encoding methods waste 
the power when the data bus transmits the same data items repeatedly. 

To reduce such waste of power, we proposed a different method, which utilizes 
a buffer to exploit the feature of repeatedly transmitted data item. The proposed 
buffer stores data items and sends an index for a data item when the data item 
to be sent is already stored in the buffer. Since the index requires fewer number 
of bits than the data itself, the wasted bandwidth or the switching activity can 
be decreased, resulting in low power consumption. 

Figure 3 describes a frequent value cache(FVC) very briefly that stores data 
items of each communication. The normal sender and receiver deliver a data 
item with a normal fashion, while the Comp and Decmp deliver a data item 
by a data itself or an index of FVC depending on the hit of FVC. When a 
data itself is transferred, all bus lines are used; however, when an index of the 
data item is transferred, only the index lines are used. Thus, the index lines are 
used for both an index and a data item. To distinguish whether a transmitted 
information represents an index or a normal data item, a control signal is used. 
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5.2 Analysis 

Three measures as hit ratio, switching activity reduction ratio, and power con- 
sumption reduction ratio are investigated. The hit ratio is the most important 
one since it decides the switching activity reduction ratio that finally determines 
the power consumption reduction ratio. To analyze, we investigated a memory 
bus in SimpleScalar model [12] and SPEC95 benchmark [13] programs. 

Hit Ratio: We found the following conclusions through investigating data pat- 
terns over the above memory bus. First, even only one entry of FVC can detect 
40% of the repeatedly transmitted data items. Second, over 256 entries can rep- 
resent most data items. 

Switching Activity: From the high hit ratio of the FVC, it is required to know 
how much switching activity can be reduced. In the research, only the change 
of signal levels between consecutive data items are measured to calculate the 
switching activity ratio of a bus. The normal dual-rail bus utilizes all 32-bit log- 
ical bits and each signal line causes two switchings, hence the switching activity 
is 32 x 2. Meanwhile, FVC delivers an index for a hit case and a normal data 
item for a miss case. In addition, the control signal changes for every commu- 
nications, hence it changes two times for each communication. Therefore, the 
switching activity when FVC is used is calculated by Equation 1. 

Phit x {1 + log (gentry)} x 2 + (1 - P hi t ) x (1 + 32) x 2 (1) 

Based on the above analysis, the switching activity reduction ratio of FVC 
over the normal dual-rail bus model is calculated by Equation 2. 



fhitX{l+log(#eratry)}x2+(l — Phit ) X (1+32) X 2 (2) 

Analysis result illustrates that FVC reduces the switching activity of the con- 
ventional model by 75% at maximum. However, the switching activity reduction 
ratio is decreased after the maximum point because of the increased number of 
index bits. 

Power Consumption: The total power consumption should include the power 
consumption of the FVC tables although the power consumption ratio of the ta- 
ble would be below 5% as explained in [16]. In addition, the power consumption 
of the bus itself should be considered as well. To measure the power consumption 
of FVC table and bus lines, it is assumed that 0.25 micron technology is used, 
and the length of the bus line is 10 mm, which follows the 2001 ITRS [17]. Power 
consumptions of the normal model and the FVC model are as follows: 

Normal Model: The power consumption is only caused by the dual-rail bus for 
logical 32-bit bus lines. Based on the 0.25 micron technology, we assume that 10 
mm bus lines consume about 0.4 nJ by using power measure tools. 

FVC Model: The power consumption is caused by two parts as the FVC table 
and bus lines. To measure the power consumption of the FVC table, CACTI 
tool [11] is used. Since all entries should be checked at the same time, it is as- 
sumed that the table is a fully-associative content address memory. The power 
consumption of FVC model can be formulated as Equation 3. Specifically, the 
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Fig. 4. Power Consumption Variation 



power consumption of the FVC table is multiplied by two because FVC model 
requires two FVC tables for a sender and a receiver. Meanwhile, when the FVC 
miss, each FVC table must be updated and it consumes more power. To in- 
clude this power consumption to update FVC, we include the Miss-Ratio in 
the equation. 

T able .Power x 2 x (1 + Miss -Ratio) 

+Bus-Power x Switching -Activity -Reduction-Ratio 

Finally, it can be derived a power consumption reduction ratio of the FVC 
model over the normal model as shown in Equation 4. 



T able-Powerx2x (l-\-Miss -Ratio) -\-(0 An J) X Switching -Activity -Reduction -Ratio ( ^ ^ 

OAnJ 

Figure 4 shows the power consumption reduction ratio when the FVC model 
is used. From the figure, it can be concluded that FVC reduces the total power 
consumption by about 14% and 22% at maximum for integer and floating-point 
benchmarks, respectively. 

6 Analysis in a Microprocessor 

Until previous sections, it have been analyzed independently the area cost and/or 
power consumption reduction ratios of the proposed methods for value predictor, 
redundant operation table, and asynchronous dual-rail bus. Meanwhile, because 
our main goal is to reduce the total area cost and power consumption of a high 
performance microprocessor, it is needed to know how much area cost and power 
consumption can be reduced when the proposed methods are used for each design 
method. 

6.1 Area Cost and Power Consumption Breakdowns 

Because no processor has been implemented with the value predictor, redundant 
operation table, and asynchronous dual-rail bus at the same time, it is required 
to model a futuristic microprocessor to investigate the portions of area cost and 
power consumption of each design method. 
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Fig. 5. Area Cost and Power Consumption Breakdowns of Alpha 21264 Model 

Conventional Model: The Alpha 21264 microprocessor [18], [19] is selected 
to find the breakdown of die size and power consumption of major blocks such 
as cache and core parts. Because value predictor and redundant operation table 
have similar structure with the cache, it can be assumed that the area cost and 
power consumption of tables for value predictor and redundant operation table 
are calculated by the relative area cost and power consumption over the cache. 
Area Cost and Power Consumption Breakdown: Alpha 21264 utilizes 128Kbyte 
Instruction and Data caches, which require about 30% of total area cost [18] 
and consumes about 15% of total power consumption [19]. Figure 5(a) and 5(b) 
show the breakdowns of area cost and power consumption of the Alpha 21264 
model, respectively. 

New Model: The new Alpha 21264 model consists of the old Alpha 21264 and 
other three design methods. Because of such modification of the old Alpha 21264 
model, the area cost and power consumption breakdowns will be changed. 

Area Cost Breakdown: The area cost of caches is about 30% and the others 
about 70% in the old Alpha 21264 processor. However, the value predictor and 
redundant operation tables add more area cost as 164Kbyte and 144Kbyte, re- 
spectively. In the old Alpha 21264 processor, 128Kbyte cache uses about 30% 
of total die size, hence it can be inferred that the value predictor increases the 
total area cost by about 38.4%, which is calculated by 30%*164/128. Also, the 
redundant operation table adds about 33.8% of total area cost, which is calcu- 
lated by 30%*144/128. Finally, the total area cost is increased by about 72.2%, 
which is calculated by the summation of the extra area cost of value predictor 
and redundant operation table. From this total area cost increase, it should be 
rearranged the portion of area cost of each component as 17.4% for cache, 22.3% 
for value predictor, 19.6% for redundant operation table, and 40.7% for others 
as shown in Table 3. As shown in the table, it can be known that the portions of 
area cost for value predictor and redundant operation table are large, about 42%. 
Power Consumption Breakdown: On the other hand, the portions of additional 
power consumption of value predictor and redundant operation table can be 
calculated by the relative power consumption over cache. It is inferred that the 
stride type value predictor consumes five times as much energy as cache from [4] . 
Hence, the value predictor consumes more energy by about 96.1%, which is calcu- 
lated by 15%*(164/128)*5. Meanwhile, the redundant operation table also con- 
sumes more energy by about 16.9%, which is calculated by 15%*(144/128). From 
this increased total power consumption, it should be rearranged the portions of 
power consumption of each block as 7.1% for cache, 2.3% for bus, 45.1% for value 
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Table 3. Area Cost Breakdown, Reduction Ratios, Relative Reduction in the new 
Alpha 21264 



Farts 


Portion 


Reduction Ratio 


Relative Reduction 


Cache 


17.4% 






CPU Core 


40.7% 






Value Predictor 


22.3% 


64% 


14.3% 


Redundant Operation Table 


19.6% 


20% 


^9% 


Total 


100% 




18.2% 



Table 4. Power Consumption Breakdown, Reduction Ratios, Relative Reduction in 
the new Alpha 21264 



Parts 


Portion 


Reduction Ratio 


Relative Reduction 


Cache 


7.1% 






CPU Core 


37.6% 






Bus 


2.3% 


" 14% 


0^% 


Value Predictor 


45.1% 


52% 


23.5% 


Redundant Operation Table 


7.9% 


34% 


2.7% 


Total 


o 

o 




26.5% 



predictor, 7.9% for redundant operation table, and 37.6% for CPU Core as shown 
in Table 4. The extra power consumption caused by value predictor, redundant 
operation table, and asynchronous dual-rail bus is very large, about 55%. 

6.2 Reduction of Total Area Cost and Power Consumption 

Reduction of Total Area Cost and Power Consumption: 

Area Cost Reduction: When the proposed area cost reduction methods for value 
predictor and redundant operation table are used, the total area cost can be 
reduced by about 18.2%, which is calculated by the summation of reduction 
ratios of area cost for value predictor (22.3% * 64% = 14.3%) and redundant 
operation table (19.6% * 20% = 3.9%), as shown in Table 3. 

Power Consumption Reduction: Meanwhile, the proposed power consumption 
reduction methods can decrease the power consumption of each design method 
by about 52% for value predictor, 34% for redundant operation table, and 
14% for asynchronous dual-rail bus, which are shown in Table 4. Therefore, 
the proposed area cost and power consumption reduction methods reduce the 
power consumption by about 23.45%(=45.1%*52%), 2.7%(=7.9%*34%), and 
0.3%(=2.3%*14%), respectively, and finally the total power consumption by 
about 26.5% as shown in Table 4. 

Area Cost and Power Consumption Breakdowns: 

Area Cost Breakdown: The portions of area cost of value predictor and redun- 
dant operation table are changed as shown in Figure 6(a). As shown in the figure, 
the total portion of area cost for value predictor and redundant operation tables 
is reduced from 42% to 29%. 
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Fig. 6. Area Cost and Power Consumption Breakdowns of Area Cost Reduced Alpha 
21264 Model 



Power Consumption Breakdown: The portions of power consumption of value 
predictor, redundant operation table, and asynchronous dual-rail bus are 
changed as shown in Figure 6(b). As shown in the figure, it can be known that 
the total portion of power consumption of value predictor, redundant operation 
table, and asynchronous dual-rail bus is reduced from 55% to 38%. 



7 Conclusion 

Throughout this research, we have pointed out that the low area and power de- 
sign methods should be proposed for design techniques for a high performance 
microprocessor. Among many techniques, three high performance design tech- 
niques have been investigated. 

Analysis of information locality and related data redundancy illustrates that 
the area and power are wasted by the data redundancy in each high performance 
design method. Therefore, the information locality was exploited and tried to 
minimize data redundancy in each method. Finally, three different approaches 
have been proposed for each method respectively. 

First, to reduce the waste of area cost and power consumption in a value pre- 
dictor, which is caused by data redundancy in tag and data part, we proposed a 
combining method of previously proposed partial tag and narrow-width method 
with an efficient table-update method. Structural and dynamic analysis show 
that the proposed method reduces the area cost by about 71% and the power 
consumption by about 61% over the conventional value predictor. Second, for 
the redundant operation table, we designed a partial tag method. Although the 
redundant operation table wastes area and power in both tag and data parts, a 
redundancy minimization method only for tag part has been discussed. The pro- 
posed method reduces the area cost by about 20% and the power consumption 
by about 34% over the ordinal redundant operation table structure. Third, to 
reduce the waste of power consumption of asynchronous dual-rail bus, we utilize 
the frequent value cache with several circuits. Analysis results show that the 
proposed method decreases the power consumption of a bus in a microprocessor 
by about 14% for integer and 22% for floating-point data communications over 
a memory bus in a microprocessor. 
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As well, we examined how much total area cost and power consumption 
can be reduced when the proposed area cost reduction methods are used for 
each design method. This analysis confirmed that the total area cost and power 
consumption would be reduced by about 18.2% and 26.5%, respectively. 
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Abstract. In the context of general-purpose processing, an increasing 
number of diverse functional units are added to cover a wide spectrum of 
applications. However, it is still possible to design custom logic adapted 
to a particular application that will perform far better than a processor. 
In an attempt to give it some adaptability, adding some reconfigurability 
can help improve performance. We propose to extend the possibilities of 
complex multifunction units by dynamically reallocating existing com- 
plex functional units as multiple simpler units. The fact that more than 
one simple unit is involved in the ’’reconfiguration” process implies that 
the decision is more global and needs to be taken for a longer period 
of time. We show that in typical superscalar architectures, there are no 
major impediments to implementing such a decision scheme, and that 
on a specific reallocation opportunity we can achieve speedups of up to 
56% over a mainstream superscalar processor and practically no losses. 



1 Introduction 

In general purpose processors, the quest for ever higher performance leads to 
many trade-offs, since one aims to achieve the best average performance on a 
variety of tasks essentially unknown to the designer. Many methods to extract 
even more parallelism, such as speculative execution or Very Long Instruction 
Word (VLIW) compiler technologies are complex and achieve diminishing re- 
turns, since the resources available to the processor are fixed. Attempts to make 
the processor adaptable to the program it is currently executing, through the 
use of reconfigurable logic, have provided mixed results. We propose to intro- 
duce some adaptability without using slow reconfigurable logic. To this end, we 
focus on the large multi-function units present in a superscalar processor. As 
an example, we expose a modification of a superscalar processor’s floating point 
functional units (FPU) to allow some adaptation to the current workload. 

Section 2 will lay out the constraints of the field and existing methods to 
achieve high performance. Next, section 3 will present our proposal and its im- 
pact on processor design. Our test methodology and reference processors will be 
exposed in section 4, with simulation results shown in section 5. Section 6 will 
bring our conclusions, the limitations of our approach, and our future directions 
of study. 



P.-C. Yew and J. Xue (Eds.): ACSAC 2004, LNCS 3189, pp. 185-198, 2004. 
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2 Background and Prior Art 

2.1 Parallelism 

The way to higher performance in general purpose processors is through forms 
of parallelism, especially by trying to execute as many instructions as possible 
at the same time. The theoretical limits in the parallelism offered by different 
programs are far higher than those achieved in reality by current processors, 
for a variety of reasons. In any case, the available hardware resources are fixed 
by the processor’s designer, and cannot be tailored to a particular application. 
They are chosen to get the best average performance. Superscalar processors, 
executing many instructions out of order every cycle, extract as much parallelism 
as possible during the execution of a program. This leads to complex designs, 
but these optimizations don’t require changes to the software. In an attempt 
to soften the restrictions of fixed hardware resources, configurable hardware has 
been examined. 

2.2 Reconfigurable Functional Units 

Given the limitations of a fixed set of hardware resources, much research has 
focused on adding some reconfigurability to a general-purpose system, usually 
based on FPGA technology. FPGAs are most efficient for code with simple 
control and large data parallelism (e.g., [2]). 

One can distinguish three different approaches, each bringing closer integra- 
tion with the processor, and thus more generality, at the expense of performance. 
The first and second couple an FPGA and a normal processor, and distribute 
the computing tasks according to what each can do best, the difference being 
whether to integrate the FPGA onto the processor chip or not. There is little 
automation possible, and selection and coding for the FPGA must be done by 
hand, including the needed communication and synchronization with the pro- 
cessor. With well chosen applications, the gains in performance can be of several 
orders of magnitude [15]. In a single-chip solution, some automation is possi- 
ble, usually with a smaller increase in performance than if optimizations are 
performed by hand (e.g., [13]). 

The last, most tightly coupled solution is to define the configurable logic 
as simply an extra functional unit (FU) of the processor. This reconfigurable 
functional unit can hold several instructions or sequences of instructions, that 
can be provided by a special compiler, and loaded by the processor when needed. 
Attempts to automate the process exist (e.g., [1], [16]), with gains similar to the 
second solution above (e.g., [5]). In each of these cases, the approach is to couple 
an existing FPGA-style block with a processor, in a more or less tightly coupled 
way. 

We propose to consider configuration possibilities as an issue in the design 
of the processor’s functional units, instead of adding a block of existing fully 
reconfigurable logic (such as FPGA technology) and trying to have the two 
cooperate. This implies a reduction in the configurability available, albeit with 
a significant gain in speed, which we hope to leverage. 
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2.3 Binary Compatibility 

The issue of binary compatibility, ensuring that all code written for previous 
versions of a processor family will work on the newest model, is a complex one. 
However, it limits the innovation that can be implemented in the processor, 
since no completely novel approach may be used. As a solution to this problem, 
dynamic binary translation has been proposed. It aims to transform code for 
one architecture into another in real-time during the execution of the program. 
Several research projects exist [14], with one commercial implementation [8]. 

Our aim is to increase performance while avoiding code changes or having a 
major impact on timing. The lack of code changes allows our improvements to 
apply to all existing code and the re-use of all compiler achievements. Preserving 
the general timing will avoid breaking or severely limiting the performance of 
existing programs not suited to our modifications. 




Fig. 1. Paths between reservation stations and functional units (top), and realloca- 
tion possibilities (bottom). Each FPU can be reallocated as a number of extra ALUs 
(xALUs). FPU operations have 5 stages, thus the FPU must be idle for 5 cycles before 
reallocation is possible. Likewise, the xALUs have 2 stages, and must all be idle for 2 
cycles at the same time to allow reallocation. 



3 Proposed Modification 

Studies on the ideal mix and functionality of functional units in a superscalar 
processor have been performed [7] . These studies show that good gains can be ob- 
tained by increasing the number of identical functional units, as well as the types 
of instructions these units can execute. We are interested in looking for ways to 
reconfigure expensive functional units to perform different operations. Given the 
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Fig. 2. Left: Structure of a floating point multiply /divide unit, with assumed cycle 
counts. Center and Right: Example of a 64 bit multiplier partial product reduction tree. 
Center: Original Wallace tree structure (total delay 14r). Right: Proposed modification 
(total delay 15r). CSAs have a delay of It, CPAs have a delay of about 5 r. For clarity, 
the multiplexers from the Register file to the CPAs for the xALU configuration are not 
shown. 



speed disadvantage of fully programmable units, which are 5 to 10 times slower 
than a dedicated custom logic in the same technology, we restrict ourselves to 
very limited changes, while maintaining speeds close to non-configurable logic. 



3.1 Basic Concept 

Multifunction units, such as the FPUs in the Intel Itanium 2 processor, can 
execute one of many different instructions each cycle. As shown in figure 1, 
we propose to reallocate an FPU, with a latency of 5 (figure la) as several 
extra ALUs (xALUs) with a latency of 2 (figure lb). These extra ALUs are 
assumed to perform all the operations normal Arithmetic Functional Units do. 
Our approach differs from multifunction units since, due to these latencies, the 
reconfiguration decision cannot be taken on a cycle- by-cycle basis, but with a 
view to the next several dozen cycles. This longer view is necessary to offset the 
idle time before reallocation, as we have to wait for the entire functional unit to 
be idle before reallocating it. We trade a small decrease in speed to obtain some 
configurability, with the hope that adapting to applications will offset the slightly 
slower configurable functional units to offer a net gain in performance. We focus 
on a processor’s floating point unit, since it is fairly large, and can often be 
idle during a program’s execution, if the current application uses mostly integer 
code. Simply adding extra ALUs would further increase power consumption and 
area, with little impact on the results (section 5) . 
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3.2 Standard Arithmetic Units 

Current fast multipliers for fixed point numbers can be built from a tree of Carry- 
Save Adders (CSA) that adds all the partial products into two words, with a final 
Carry- Propagate Adder (CPA) for the last addition [10]. The exact structure of 
the tree may vary to achieve better regularity, essential for good integration. A 
division unit can have a similar structure, if a convergence algorithm is used. 
This would lead to the common implementation of a Mul/Div unit [11], with 
the tree structure qualitatively as in figure 2 (center). Each CSA or CPA block 
might need an inverter to allow subtraction. The CSA tree has [log 3 / 2 (64)] = 9 
levels. 

A floating-point Mul/Div unit is essentially a fixed-point Mul/Div unit with 
some extra logic to unpack the operands, perform Booth recoding if it is used, 
normalize the result and re-pack it into floating-point notation, as shown in figure 
2 (left). The presence of a full CPA adder allows the re-use of the unpack and 
pack logic to include all floating point operations in the unit. It is also possible 
to use the floating point unit for integer multiplication and division, as in the 
Intel Itanium 2 processor [9]. 

3.3 Dynamic Functional Units 

We propose to use the adders in an FPU as a number of xAL Us , with character- 
istics similar to normal ALUs. As a CSA cannot be used to perform a complete 
addition, several CSAs in the tree could be replaced by CPA adders as in fig- 
ure 2 (right) with only a minimal impact on the overall critical path, area and 
power consumption. This figure shows the proposed modifications to the reduc- 
tion tree, which affect only the steering of the data, not the logic performed 
on it. The CPAs directly receive some of the partial products while the other 
partial products go through the CSA tree to allow time for the far slower CPAs 
to finish execution, resulting in only a small extra delay due to the unbalancing 
of the tree. This requires some extra logic: to handle logic operations other than 
add/subtract, to bring the operands for the extra instructions that will be exe- 
cuted, to bypass the floating point logic, and to switch between the two different 
modes of execution. 



3.4 Effects on Functional Unit Latencies 

Our reference for instruction latencies is the Intel Itanium 2 processor, one of the 
fastest (and certainly the largest) existing processor [17]. This processor has a 
latency of 1 for all ALU operations, and a latency of 4 for all FP operations and 
Integer Mul/Div operations. These latencies are considered here representative 
of current 64-bit processors, and the functional units are fully pipelined. 

In deep sub-micron technology, such as 0.13/nn, wires account for about 
2/3 of the delays, and the differences between 0.13/im and 0.09/im are not so 
important in this regard. The increase in wiring to reach the xALUs is estimated 
at about double that needed for normal ALUs. Thus, if a normal ALU has a 
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latency of 1 cycle, split as 1/3 gates and 2/3 wires, doubling the wires gives a 
xALU latency of 5/3. Taking the multiplexers to select the adders in the FPU 
into account, a conservative estimate for the latency of all extra ALU units is to 
double the latency of normal ALUs, for a latency of 2. As confirming this timing 
would require designing the entire functional core of a superscalar processor, a 
complex task beyond our means, simulations with a very conservative latency 
of 3, where about 89% of the delay is in the wires, have also been performed. 
Additionally, some of the bypass paths necessary to keep the pipeline as full as 
possible, and counted in the above calculations, are likely to already be present 
in the multiplier’s tree linking the xALUs together. This also means that the 
overhead is less than that of simply adding extra ALUs to the processor. 

The latency of the entire FPU being 4, we consider that the unpack stage 
takes one cycle, the multiplier tree takes 2 cycles, and the normalization and 
pack take the last cycle (figure 2 left). The replacement of some of the CSA 
adders by CPA adders will increase the total delay of the multiplier tree. From 
[10], considering that a CSA has a delay of It, a delay of 5r for a 64-bit CPA 
can be derived. The total delay for a 64-bit CSA tree with 9 levels (see section 
3.2) and the final CPA, is thus 9t + 5r = 14r (figure 2 center). We assume 
this delay represents 2 cycles (figure 2 left), as both real processor data [9] and 
arithmetic considerations [11] suggest. As shown in figure 2 (right), implementing 
our modifications on the FPU to embed 3 CPAs in the compressor tree would 
increase its delay to 8r + 2r + 5r = 15r plus the delay a multiplexer in front of 
each CPA (figure 2 right). To be on the conservative side, and since functional 
unit latencies must be integral, we have assumed the total delay of the modified 
tree to be 21r, equivalent to 3 cycles, an increase of 50%, or one cycle, for a 
total delay of 5 cycles in the functional unit. This adds a margin of 6r, almost 
40%, that is, many layers of logic, to the timing of the FPUs CSA tree. In any 
case, the partial products reduction tree is a logarithmic tree which can be easily 
unbalanced as needed to hide the delay of the CPAs, and so the inaccuracy due 
to the delays of the multiplexors and the bypass paths should not be significant 
overall. 

Since the reconfiguration is achieved by switching the inputs of a few multi- 
plexers, it takes only a single cycle, in addition to having to wait for the func- 
tional units to be idle, with no changes to the pipeline except the activation of 
the forwarding paths discussed above. The routing of the processor core must 
be redone to take the new data paths into account, but this kind of work must 
be done for newer technologies in any case. These numbers are summarized in 
table 1. 



3.5 Switch Decision Mechanism 

Given the possibility of changing an FPU into a number of xALUs, the issue of 
deciding when to perform this change, and when to change back, is posed. Since 
this decision cannot be taken every cycle because it is a global decision affecting 
several functional units (figure 1), an algorithm to adapt the resources to the 
code running at a given moment is needed. The basis for the decision is the type 
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of instructions in the reservation stations. This gives a measure of the type of 
instructions the processor can expect to be executing a few cycles later. In the 
simplest case, the number of instructions of each type are then compared to the 
number of available functional units of the same type to make a decision. A switch 
is decided when the difference between the proportion of instructions of a type in 
the reservation stations and the resources of that type becomes too large. In the 
relatively common case that an instruction type should appear very infrequently, 
such as an integer program with very few multiplications, the algorithm above 
will not trigger a switch , since the threshold is not reached by a single instruction. 
In this case, we must detect that an instruction cannot be executed due to 
the absence of the correct resource type, and force a switch , regardless of the 
contents of the reservation stations. In all cases, a switch decision must wait 
until the functional unit(s) it wants to reallocate are completely idle, in which 
case it takes only a single cycle. It would be possible to switch while the FPU is 
still finishing the last calculation, during the normalization/ pack stage, but this 
would greatly increase the complexity of the control path without a great effect 
on performance, through the pipelining of the switch logic and extra complexity 
in the pipeline. 



3.6 Additional Considerations 

The act of switching one or more FPUs into a number of xALUs increases the 
pressure on the memory system, as well as providing the need for extra issue, 
dispatch and commit width. Though the memory bandwidth remains the same, 
a higher number of Load/Store units are required to avoid stalling the processor 
due to many memory requests. In our simulations, 4 such units (as in the Itanium 
2) were a good balance between performance and complexity. The widest issue 
rate in current processors is 8 instructions per cycle [17]. A larger issue rate 
increased the gains of dynamic reconfiguration, but only slightly. Thus, the issue 
and dispatch widths were kept at 8. The commit width need not be as large as 
the issue/dispatclr width, since the average number of instructions committed 
per cycle is lower than the maximum. In our simulations, the highest average IPC 
was slightly below 4 (vortex), leading to a commit width of 8 to avoid limiting 
performance, as the simulator used requires it to be a power of 2, although a 
value of 4 could be considered. 



4 Experimental Methodology 

All the results presented in section 5 were obtained through the use of the 
Simplescalar tool set [3] . The models used for the hardware are detailed in section 
4.2. On the software side, the SPEC CPU2000 [6] benchmarks were used for all 
tests. 
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Table 1. Processor model resources. The baseline mainstream and baseline top pro- 
cessors were compared to their dynamic counterparts in all simulations, with original 
mainstream and original top shown as references, supertop is equivalent to dynamic 
top with 4 additional ALUs and no reconfiguration. 



Model 


#ALUs 

(latency) 


#FPUs (latency) 


#Load/Store 

units 


#xALUs 
per FPU 

(latency) 


issue-dispatch- 
commit widths 


original 

mainstream 


3 ( 1 ) 


2 (4) 


2 


- 


4-4-4 


original top 


6 ( 1 ) 


2 (4) 


4 


- 


8 - 8-8 


baseline 

mainstream 


3 ( 1 ) 


2 (4) 


4 


_ 


8 - 8-8 


baseline top 


6 ( 1 ) 


2 (4) 


5 


- 


12 - 12-8 


dynamic 

mainstream 


3 ( 1 ) 


2 (5) 


4 


4(2) 


8 - 8-8 


dynamic top 


6 ( 1 ) 


2 (5) 


5 


4(2) 


12 - 12-8 


supertop 


10 ( 1 ) 


2 (5) 


5 


- 


12 - 12-8 




Benchmark 



Fig. 3. Simulation results for the SPEC benchmarks for the baseline mainstream (light) 
and dynamic mainstream (dark) processors. There are large variations in the overall 
IPC, with some significant gains by the dynamic model. 
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Fig. 4. Speedups between the baseline mainstream and the dynamic mainstream mod- 
els. The integer benchmarks show universal gains, whereas the FP benchmark results 
are more varied. Except for sixtrack, all negative speedups are very small, less than 1% 
slower than the baseline. 



4.1 Modifications to Simplescalar 

The most accurate simulator in the Simplescalar tool set, sim-outorder, was 
modified so that a number of FPUs can be turned into several xALUs. The 
switch decision algorithm was also added to the simulator’s main loop, to choose 
whether and how to change the allocation of resources during program execution. 

4.2 Reference Processors and Models 

Two different references, loosely inspired from mainstream and top server pro- 
cessors available today, and considered representative of the state of the art in 
general-purpose processors, were used: 

Our mainstream reference is similar to the IBM Power4 processor (a single 
core), and is close to the average resource configuration of current processors. 
Each core is a 4- way superscalar processor, and has 2 ALUs, 2 load/store units, 
one branch unit and 2 FPUs. 

Our top reference is loosely based on the Intel Itanium 2 processor, one of the 
fastest server processors available today, as measured by SPEC benchmarks. It 
has 2 ALUs, 4 load/store units that can also perform ALU operations, 3 branch 
units, and 2 floating point units that also take care of integer multiplication. 
Although it is a VLIW processor, its resources represent well the most aggressive 
configuration achievable nowadays. 
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For a fair comparison, both reference models are given the same memory 
access bandwidth and ports as our proposed model (4 or 5 load/store units and 
a 128-bit wide access to memory), as well as the same issue/dispatch/commit 
widths, giving us our baseline mainstream and baseline top models. Although 
these models are somewhat unbalanced, not increasing the number of load/store 
units would cripple the dynamic models, which are obtained by increasing the 
FPU latency as explained in section 3.4 and adding dynamic reallocation. Super- 
top is defined as a fully static top, with 4 additional ALUs and no reconfiguration, 
and is used to show the small difference in performance compared to the dynamic 
top. These characteristics are summarized in table 1. 



4.3 SPEC CPU 2000 Benchmarks 

All our tests considered the entire set of 26 benchmarks comprising the SPEC 
CPU2000 suite. The binaries are provided for the DEC Alpha [4] Instruction Set 
Architecture (ISA) on the Simplescalar WWW site [3], and have been compiled 
using the ’peak ’ configuration. The data sets chosen are the reference sets from 
the SPEC suite, given the length of the full simulations, early Simpoints [12] 
were used to provide statistically significant results for the mainstream model, 
detailed in figures 3 and 4. Due to time constraints, and since they are only 
intended to show the limits of reallocation, the top and supertop models were 
simulated skipping a smaller number of instructions than Simpoint suggests. 
Although the individual results may vary, the average over the 26 benchmarks 
is similar to that obtained using early simpoints, and sufficient to show a trend 
of diminishing returns. 



5 Results 

5.1 Performance Results 

Figure 3 shows the results of our simulations for the mainstream model, using 
the configurations in table 1, lines 3 and 5. The speedups when using perfect 
memories, not shown, show little difference with those presented here, demon- 
strating that reasonable memory latencies have little effect on the gains made 
by dynamic reallocation. The best performing benchmark was vortex, with a 
gain of 56%, since it uses many independent ALU operations and very few FP 
instructions, thus being able to make good use of the xALUs, and the worst was 
sixtrack , with a loss of 3.8%, which is mostly composed of FP add and multiply, 
and is thus strongly affected by the increase in FPU latency. The average gain for 
the integer benchmarks was 19%, and 3.5% for the floating-point benchmarks. 
The overall average for the entire suite was a gain of a little more than 10%. For 
clarity, the corresponding speedups for the entire set of benchmarks are shown 
in figure 4. There is a systematic gain, only seldom insignificant, and the rare 
losses in heavily FP-oriented benchmarks are rather small, with the exception 
of sixtrack. 
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Fig. 5. Left: Structural stalls for mcf (top) and wupwise (bottom). The left side is the 
mainstream baseline case, the right side is with dynamic reallocation. Mcf is limited 
by ALU instructions, and shows a large reduction in ALU stalls. Wupwise sees little 
change in stalls, and thus cannot benefit from reallocation. Right: Instruction types 
for galgel. As there is no region with few FP instructions and many ALU requests, the 
allocation decision is to have no xALUs, resulting in lower performance. 



Mcf 



Sixtrack 



| 

£ 50 




Fig. 6. Instruction types (top) and resource allocation (bottom) for mcf (left) and 
sixtrack (right). For mcf, as there are almost no FPU instructions, the configuration 
is always to use 8 xALUs. When an FPU instruction arrives, the FPU is switched to 
execute it, and then immediately switches back. In the case of sixtrack, the alloca- 
tion of the FPU’s resources adapts to the instruction types: when there are few FPU 
instructions, the units will be reallocated as xALUs. 
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The results for the top model, described by lines 4 and 6 in table 1, show a 
reduction in the gains obtained, due to far less usage of the xALUs , as there are 
already 6 ALUs in the processor. Again, memory latency did not significantly 
affect the speedups. The average gains were 3.7% for integer benchmarks, and 
1.5% for floating-point, giving a total average gain of 2.5%. For comparison, the 
Supertop model gives an average gain of 3.1% versus the baseline top, at the cost 
of a larger set of functional units and resources on the die. If the xALUs latency 
is increased to 3, the results show a reduction in the average gain from 10% to 
7%, and in the maximum gain from 56% to 35%. Thus, although this delay is 
somehow critical to our gain, the benefit of our system does not fully rely on 
these timing assumptions. Losses are not affected, since these benchmarks rarely 
use the xALUs, if ever. 

5.2 Influence of Instruction Types 

The large differences in speedups for the different benchmarks can be explained 
by looking at the instruction types used in these benchmarks. We shall use 
three benchmarks to illustrate this point: mcf , wupwise and galgel. The follow- 
ing graphs show good examples of the different behaviors reallocation produces. 
However, these are not necessarily representative of the overall benchmark re- 
sults. Figure 5 (left) shows the number of structural stalls — i.e. , the number of 
instructions of each type which had all operands ready, but couldn’t execute due 
to a lack of functional unit, for the first two benchmarks with the mainstream 
model. The former, mcf, is limited here almost only by ALU instructions in ad- 
dition to memory accesses, and thus benefits greatly from our proposal, since 
both FPUs get reallocated into many xAL Us, switching back regularly to service 
the FP operations. This behavior is shown in figure 6 (left). The limitation by 
the Load/Store units appears because all ALU instructions that were previously 
waiting for a functional unit have been executed by one of the xALUs , and the 
memory accesses that had time to execute in the baseline case now stall the 
processor while waiting for the Load/Store units, which are now far less nu- 
merous than the ALUs. On the other hand, wupwise uses a fairly diverse mix 
of instruction types, with a heavy emphasis on floating-point add and multi- 
ply/divide instructions. The switching mechanism is constantly reallocating the 
functional units to try to match the instruction mix at each moment in time. In 
this case, the extra AL Us available at some moments cannot compensate for the 
slowdown of the FPUs’ mul/div units and the delays in switching between the 
two. To illustrate this, a short trace of the instruction types for galgel is shown 
in figure 5 (right). The corresponding switch decision, not shown, is to never use 
the xALUs, leading to a loss in performance due to a longer latency in the FPU. 

5.3 Switching Dynamics 

For the resource reallocation to work, the switch mechanism must configure the 
hardware to make the best use of the configurable resources. Figure 6 (right) 
shows a short trace from the sixtrack benchmark, taken after approximately 
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10 9 instructions. Figure 6 (top right) displays the number of instructions com- 
mitted from the ALUs and the FPUs, while figure 6 (bottom right) shows the 
configuration of the FPU over the same period of time. 

The pattern shown is one of the startup loops in the application, and repeats 
regularly around the instruction count shown. At around 200 cycles, there are 
more FPU instructions than ALU ones, and the switching mechanism does not 
allocate any xALUs. However, at 300 cycles, the situation reverses, and one 
FPU is converted into 4 xALUs. A sharp spike in ALU instructions coupled 
with a sharp drop in FP instructions at 450 cycles will cause both FPUs to be 
reallocated as 8 xALUs for a brief moment, before resuming FP functions. A long 
period of relative stability, between 650 and 850 cycles leads to a unchanging 
configuration. 

6 Conclusions and Future Work 

We have proposed a method to gain some hardware adaptability to the code 
running on a general-purpose processor that does not sacrifice the speed of the 
configurable unit or compromise binary compatibility. This technique is distinc- 
tive in requiring the logic of the superscalar processor to make more global 
decisions than it normally does. The conditions for the simulations have been 
derived from real data measured from 0.13/im technology. The results show the 
use of a dynamic FPU is quite interesting in the case of processors with a modest 
number of ALUs, and that naturally the interest declines with a large number 
of ALUs already in the processor. Our idea, based on giving the processor more 
possibilities for parallelism, should be seen as an example of the possibilities 
in superscalar processors that can be exploited by multi-cycle reallocation de- 
cisions. When superscalar processors will enter the embedded System-on-Chip 
world, the common use of domain-specific instructions or coprocessors for these 
applications will increase the opportunities for similar forms of reconfiguration. 

We intend to apply control theory to the decision mechanism, in order to 
better tailor the resources to the application. Simulations on a SMT processor 
are expected to produce interesting results, due to the extra parallelism exposed 
by the multiple threads. We also envision to research the possibility of using soft- 
ware hints in the code to guide resource reallocation. While this would maintain 
backward binary compatibility, it will require a recompilation and some analysis 
of the code to produce better gains. In a similar vein, it might also be possible 
to apply this method to VLIW processors, in which case the resource allocation 
would simply be another information generated by the compiler. 



Acknowledgment. We would like to thank the anonymous reviewers for their 
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Abstract. Intensive processing applications, such as scientific computation, 
signal processing, and graphics rendering, motivate new processor architectures 
that place new burdens on the designer. These applications named Stream 
Applications demand very high arithmetic rates and data bandwidth, but lack 
data reuse. At present modern VLSI technology makes arithmetic units 
relatively cheaper. MASAIMultiple-dimension scalable Adaptive Stream 
Architecture) presented in this paper is a prototype that operate on streams 
directly. It is different from DSP and special high performance single-chip 
architecture because it combines flexibility and high performance. It has basic 
features of all stream processing, provides bandwidth hierarchy, makes ALU 
array execute with full loads and decomposes application into a set of 
computation modules to execute space-multiplexing or time-multiplexing. The 
multiple dimensions scalability of MASA, includes task-level, loop-level, 
instruction-level and data-level, and enables it to meet the demand of stream 
applications. This paper describes MASA architecture and stream model in the 
first half, and explores the features and advantages of MASA through mapping 
stream applications to hardware in the second half. 



1 Overview 

Under the power of Moore’s law, the number of transistors integrated on chips has 
been increasing rapidly and the performance of chips has been enhanced constantly. 
In a contemporary 0.1 ppm CMOS technology, a 32-bit integer adder requires less 
than 0.05mm 2 of chip area. Integrating more ALUs on single chip is not a problem any 
more. On the other hand this situation brings some new problems to the computer 
designer. One is how to support so many ALUs with enough instruction and data. The 
long wire delay is becoming more and more unavoidable owing to the increase of the 
density on chip. The other is how to make full use of the ability of chips’ integration, 
that is, how to make full use of chips’ area to compute other than traditional general- 
purpose architecture, for example only 6.5% of the Itanium 2 die is devoted to 
arithmetic units[l], and large fraction of its die area is consumed by other processing 
such as data cache, branch prediction, out-of-order execution and communication 
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schedule. In addition, low efficiency is worth attention as the demands of applications 
increase. 

Under this condition, traditional microarchitecture has met the challenge. How is 
the performance of microprocessors kept on scaling at the rate of 50% or even more 
per year[2]? Many researches on new microarchitecture of billion or more transistor 
chips are going on very well. And some technologies have come to life, such as 
systolic arrays in 1970s, data flow in 1980s and vectors in 1990s. 

Stream architecture operates on data streams. It is first applied in media processing 
because of intensive computation, high data parallelism, and little data reuse. Stream 
has various formats, fixed length or variable length, and supports for collections of 
complex or simple elements. But data stream is consistent with the colloquial sense. 
According to Webster, a stream is “an unbroken flow(as of gas or particles of 
matter)”, a “steady succession (as of words or events)”, a “constantly renewed supply,” 
or “a continuous moving procession (a stream of traffic)”[3]. The difference between 
stream and data flow is that some architecture adopts instruction driven rather than 
data driven. Though there are several kinds of stream architectures, the common 
feature is to take stream as architectural primitives in hardware. Stream architecture is 
suitable for VLSI technology and supports enough functional units to achieve the 
needed arithmetic rates because it has scale register file architecture and so on. At the 
same time its organization is quite simple, so it will be a hotspot in the near future, 

MASA (Multiple-dimension Scalable Adaptive Stream Architecture) is a kind of 
stream architecture. The definition of programming is expanded, meaning that almost 
all parts are programmable, including inter-ALU switch, inter-arithmetic pages 
communication and register organization. So it has perfect flexibility compared to 
special architecture. It shares common characteristics of all stream processing, 
provides multiple-level bandwidth hierarchy, make ALU arrays execute with full 
loads. According to the demands of various applications, MASA decomposes 
application into some modules to do space multiplexing or time multiplexing and can 
be scaled in multiple dimensions (including task-level, loop-level, instruction-level 
and data-level). 

The remainder of this paper is organized as follows. Section 2(“related work”) 
cites prior work in streaming and architecture that has inspired MASA. Section 
3(“MASA microarchitecture”) presents MASA. Section 4(“the MASA’s stream 
model”) discusses executing model of stream in MASA. Section 5(“stream 
application studies”) analyses the computing process with mapping a typical 
application onto the stream executing model of MASA. Compared with processing 
model of scalar and vector, it explores the features and advantages of MASA. The last 
section (Section 6) summarizes the conclusions drawn in this paper. 



2 Related Work 

MASA draws heavily on the prior work of numerous parallel models and 
architectures. This section highlights only a few of those works. 

Viram[4,5] adopts multiple-level vector pipeline and places large memory in the 
chip. It integrates typical vector with PIM technology. However, it has limited 
scalability. 
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Raw[6], as a typical representative of tiled architecture, implements thread-level 
parallelism and is reconfigurable. But the connection of inter-tile is complex and the 
cost of crossbar is very significant. 

Score[7,8] exploits thread-level parallelism like Raw, and has good compatibility 
by hiding the size of hardware from programmer. Although computer pages are 
reconfigurable, the waste of logic is quite large and the implementation is hard. 

Imagine[9,10,ll], as a pioneer of stream architecture, expands vector technology. 
The structure is very simple. It provides bandwidth hierarchy and resolves the 
bottleneck very well. It performs time multiplexing for multiple kernels in single chip 
and implements instruction-level and data-level parallelism, but task-level parallelism 
is weak. On the basis of Imagine, a supercomputer called Merrimac[19] is developed. 
One Merrimac stream processor is very similar to Imagine. MASA is influenced by 
Imagine heavily. 

VLIW[12] exploits instruction level parallelism well. A VLIW architecture is low 
power efficiency, because in this case, a compiler performs scheduling of operations 
rather than hardware in superscalar architecture. But the scalability is very poor. At 
present many improved VLIW have come out such as dynamic VLIW technology 

[13]. 

Trips[14,15] is a general-purpose chip facing 2010 which consists of one or more 
inconnected grid processors working in parallel. It is a Polymorphous system, 
meaning that hardware should adapt itself to a variety of runtime workloads. 
However, it is essentially a reconfigurable array, just that the task of compiler and OS 
is new. Anyhow, it is quite complex. 



3 MASA Microarchitecture 

The prototype microarchitecture of MASA is shown in Figure 1. MASA is a 
programmable stream processor, which works as a stream coprocessor. Scalar 
program is executed on the host processor. Actually, MASA implements logic stream 
model in a single chip. The whole architecture can be divided into three layers by 
different bandwidth levels, and each layer owns its special controller, communication 
style and storage unit. 

In the outer layer, scalar processor executes scalar instructions like MIPS and 
transfers explicit stream instructions to instruction caches in each compute engine 
(CE), which is a set of units executing thread-level tasks partitioned ahead. Compute 
engine can complete a task independently, so that MASA may own only one compute 
engine, adding the engine only when more task level computing is required. Stream 
memory system transfers streams between stream register file (SRF) and off-chip 
SRAM or SDRAM. MMU and compute engine are connected with multiple buses that 
we call Multi Stream-bus. Engines get stream-bus ownership by asynchronous 
request-answer signal’s competition. After request is granted, stream records are 
transferred to SRF in the form of group or burst sequentially. Stream record is the 
elementary particle of data block. Arbiter is responsible for the grant or other signals, 
and the Arbiter principle can be either time-multiplexing or dynamic priority. Use 
priority for each bus can be set dynamically, that means the priority may be modified 
by the host processor. For each bus, the Arbiter has a special register to indicate the 
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priority. Because of continuity of stream data, the bus interface of each engine has a 
lot of buffers called stream buffers in order to hide the latency of memory access. 




The next layer is compute engine. Each engine consists of many arithmetic page 
arrays (APAs), SRF banks and a control unit. Arithmetic page array, which contains 
lots of arithmetic logic units, is the basic unit for executing kernel microcode 
program. APA schedule unit determines which APA executes a certain kernel, that 
means how to map the logic kernel to physical execute units. There is a scoreboard 
making records about the information of APA’s status in the schedule unit. According 
to the information from schedule unit, stream controller dispatches stream 
instructions to one or more APAs. Scalar processor controls stream controller by 
stream instructions. These instructions determine kernels’ process procedure and 
which APA is distributed to which kernel. Kernels are space-multiplexed or time- 
multiplexed that will be discussed in the next section. Instructions are combined as 
VLIW, and microcode consists of a series of 256-512 bit instruction words. When 
microcode is fetched from memory, it will be dynamically reorganized by an 
instruction issue unit to adapt to the scalability. Since instruction buffer of APA is 
limited and instruction will be regularly reused, an instruction cache is necessary, and 
the basic access unit is no longer an instruction but the whole microcode of a kernel. 
Instructions in the next execution that have been reorganized last time are fetched 
from instruction cache, so we needn’t reorganize or distribute them again, otherwise it 
will be sent to off-chip memory instead. All transfers of data are passed through SRF. 
Stream data represents the principle of locality, so SRF is divided into several banks, 
whose size is tens Kilo-words. So that MAS A can effectively increase the SRF’s 
bandwidth. Data in a logic kernel is not allowed to store across banks. According to 
the destination of data, SRF may partition into main SRF ( MSRF ) banks and 
communication SRF ( CSRF ) banks. The former is responsible for the exchange of 
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data between memory and engine, or dataflow between kernels. And the latter is 
responsible for the communication of global data among computer engines. MASA 
uses topology like tree to connect SRF to APA, which follows weak dependence and 
unilateral transformation of stream. At this layer, both instruction and data are 
transformed in the same data path, but in different time, so the execution of program 
is divided into two parts: configuration time and execution time. Partition of kernels 
and issue of instructions are completed in configuration time. After that, stream 
controller answers for transferring data and executing program according to kernels 
divided ahead. During this period, kernel program is kept unchangeable in APA and 
data streams are transformed from SRF to APA in sequence. 

The most inner of MASA is kernel layer. Microcodes are executed in APA in the 
form of SIMD. Each APA consists of many arithmetic pages (APs) and one central 
array controller. And each AP contains eight to sixteen arithmetic logic units, 
including floating adder, floating multiplier, divider and some special function units. 
Each ALU owns its local register file (LRF) to keep its local data. A LRF is keeping a 
small suitable size that can be accessed fast enough. All ALUs in an AP are 
connected by programmable switch net, so ALUs in an AP are able to communicate 
each other fast and flexibly. But the communication between APs will take much 
more cycles; fortunately the stream application limits these communications in a 
negligible degree. VLIW codes of kernel program are buffered in array controller. AC 
has special instruction RAM to hold thousands of VLIW, and that is also responsible 
for decoding packed VLIW to control signals and transferring these signals to every 
AP in an APA. 

The bandwidth hierarchy of stream processor insures kernel program not to access 
memory directly. The memory access latency can be effectively hidden by buffer and 
soft pipeline etc. All inputs and outputs of kernels have to transfer through SRF as the 
type of stream. Analogously, temporary results and local data that just exist or are 
used in a kernel are limited in the fast LRF, so the bandwidth requirement for SRF 
can be greatly reduced. By the way, operands’ fetch time is explicitly determined so 
that no miss will happen. As a result, in kernel level all instructions’ executing time 
are explicitly determined without any potential uncertain cycles, so in the control unit 
we can get an accurate static VLIW schedule time table. For this static structure, 
VLIW could be most efficient because it simplifies or cancels the hardware unit for 
some dynamic schedule such as extra register renaming, dependence detection, issue 
out of order and so on. Obviously, according to the features of stream algorithms, 
MASA emphatically solves the problem of high bandwidth and multi-kernel 
executing, so it can achieve very fast speed but simple structure which are of great 
benefit to reduce area and power efficiency. 

According to bandwidth hierarchy, MASA defines different layers of architecture. 
In fact these layers correspond with different parallelisms that are task-level- 
parallelism (TLP), instruction-level- parallelism (ILP) and data-level-parallelism 
(DLP) strictly, which make us easily scale MASA in any dimension of these 
parallelisms. It means MASA can be emphatically scaled in the most needed 
dimension, so MASA’s scalability is much more powerful and efficient. However, the 
most challenging work in scaling is how to partition basic APAs to different kernels 
in balance. MASA could solve this problem by the explicit definition in program or 
the cooperation of compiler and hardware (such as APA schedule unit). Therefore, 
MASA is well adaptive. 
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4 The MASA’s Stream Model 



The definition of stream in MASA is similar to that in other stream architectures. It 
consists of successive ordinal isomorphic elements. Stream has various formats, fixed 
or variable length, and complex or simple elements. 

MASA works as a coprocessor for a scalar processor. It receives stream data and 
stream instructions from scalar processor, and transfers the results to the host. On the 
other hand, scalar processor performs scalar programs. 

The application of stream on MASA is divided into three levels: stream-thread 
level, stream scheduling level and kernel execution level. The instance of the model is 
shown in Figure 2. 
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Fig. 2. Hypothetical MASA model 



At the top level, the stream-thread level programming is done using thread C- like 
language. It is responsible for memory access scheduling in multi-buses and low- 
intensity communication among threads. Task is scheduled at the thread level. 
However, this level is not necessary, only used in large-scale computation. (For 
instance, there are 2 or more engines in MASA). 

The second level programming is done using extended streamC[16] language. A 
stream task is decomposed to a series of computation kernels that deal with a great 
number of stream data. It permits determining executing order and size of kernel 
pattern by programmer. Program at stream scheduling level controls the whole flow 
of a stream task’s execution, including communicating towards scalar processor, 
loading stream across the high-speed bus from off-chip, storing stream to off-chip 
memory, and starting kernel. The flow route of stream among kernels is similar to 
dataflow graph. The kernel operates successive data in the input stream, and produces 
the output stream as the input stream for other kernels. Similar to product line, each 
kernel just processes one record of input stream every time, and then appends result to 
the output stream. Kernels may be executed in two modes: time-multiplexing and 




Multiple-Dimension Scalable Adaptive Stream Architecture 



205 



space-multiplexing mode. In the former mode kernels are executed one by one at 
different time, and gain all the computing recourse exclusively. In the latter mode, 
several kernels share the whole computing recourse at the same time, we call the set 
of these kernels synchronous kernels. If adopting space-multiplexing mode, the output 
record will be sent to next kernel directly, as it may be transformed through buffer in 
time-multiplexing mode. 

Computation kernels lie in the third level. It describes execution of single kernel in 
kernelC[16] -like language. The details of arithmetic operation and data executing 
mode are defined at this level. A kernel executed in AP is composed of VLIW 
instructions to exploit the great instruction parallelism. Moreover, by requiring 
programmers to explicitly use appropriate type of communication for each data 
element, this level expresses the application’s inherent locality. All the temporary 
values just generate and exist within kernel. Thus the model does not use inter-kernel 
communication for temporary values. 

Before the execution in chip, compiler and hardware decide how many APAs are 
available to execute several kernels simultaneously or time-sharing, and how many 
APAs each kernel occupies. Between APAs that implement different kernels, data is 
transferred in the form of stream. When forming logic kernels, programmer should 
consider the number of kernel’s arithmetic operations and the size of intermediate 
results. An ideal kernel should have enough quantity of operation on local data but 
small set of intermediate results between kernels. At the same time, the balance 
between producer kernel and consumer kernel is important. We can introduce kernel 
fission and fusion technology into decomposing kernels. 



5 Stream Application Studies - Fluid Compute 

Stream applications are defined as the applications that can be mapped to stream 
programming model. Those focus on intensive computation domains, including 
scientific computation, graph rendering, media processing and so on. Features of 
stream application are obvious: First, computation is intensive and is limited within 
given time. Second, data has little reuse and weak dependence. Third, it is easy to 
divide program into some modules. Fluid compute is a typical scientific computation, 
which is applied in many important domains, such as aerospace, space flight, machine 
manufacture and so on. This paper takes numerical simulations of complex steady 
flow in hypersonic free stream as a stream application to map to MASA. We analyze 
kernel decomposing and intermediate result, and also compare MASA with some 
other architectures. 

The whole computation has three steps in terms of LU-SGS: First, input the 
parameters and choose the format of space and time. Second, calculate the values of 
basic controlling equation group. Third, advance by time and iterate to solve the 
equations. Repeat the second and third steps till result is convergent[17]. Figure 3(a) 
diagrams part of LU-SGS algorithm. 

The program has some characteristics shown in the following: 

• Data represents locality and computation is advanced towards unilateral direction. 
The value of each point only needs information of two or three points around, so it 
can be paralleled on several points. 
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• The dependence distance of iteration is very small, because each iteration only 
needs the results of one to three latest iterations. It is convenient to exploit kernel- 
level parallelism by loop unrolling or software pipeline. 

• There is no dependence between fluxes computation, so they can be executed in 
parallel as multiple tasks. The function in the program is easy to be divided into 
some blocks in accordance with loop. As a result, they can be mapped onto several 
kernels directly. 

• Computationally intensive. The application requires many arithmetic operations 
per memory reference. 

• There is few conditional branch which may be converted to conditional instruction. 
The characteristics of computation and data match MASA. 
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Fig. 3. (a) part of LU-SGS algorithm (b) complex steady flow computing 



The main part of the computation is to calculate partial differential equation to get 

au r)E r)F dG dE v 3F V 3G V 

six fluxes, that is 1 1 1 = 1 1 where U is 

dt dx dy dz dx dy dz 



conservative variable, E, F, G, E v , F v , G v are viscous or inviscid fluxes in the direction 
of x, y, and z. Computation of three inviscid fluxes is most intensive. Figure 3(b) 
shows how we map solving partial differential equation to the stream programming 
model in brief. Where T and q mean stress tensor and heat tensor respectively. H is 
enthalpy of unit quality, p , p , u , v , w , T mean density, pressure, three components of 
the velocity and temperature respectively. e,E,H , , ju , , k mean energy, enthalpy 



and coefficient of viscosity respectively. Re ,Ma _ , y , Pr , , Pr , mean Reynolds 
Number, Mach Number, heat ratio and Prandtl Number respectively. 

Take 300 thousands points as input data, the computation of three inviscid fluxes is 
1.5 to 2 billion single precision floating operations for each iteration, which mainly 
contains addition and multiplication. The total computation for each iteration is about 
2.5to 3.5 billion. The test program is small-scale, it needs 10000 steps to acquire 
constringent result at least. 15000 steps are assumed, and the total amount of 
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arithmetic operations will be 4500 billion. It takes more than twenty hours, and the 
cost of memory is about 300MB, when the program is run on Pentium 4. 




Fig. 4. Stream-Based inviscid flux computing 



Figure 4 shows how we map computation of a inviscid flux to the stream 
programming model. The processing is composed of eight computation kernels that 
operate on data streams in succession. Where input data is fetched from memory and 
output data is stored to memory, rectangle represents stream in SRF, and a grid 
represents a stream element. Temporary data required by arithmetic operations is 
stored in LRF. For example, kernel3 receives 2 input streams and produces 1 output 
stream, performing 205 arithmetic Ops. 

Table 1 compares the memory, global register, and local register bandwidth 
requirements of a stream architecture (MASA) with a vector processor and a scalar 
processor for the kernel3 2 . 

The content illuminated in the left-most column of Table 1 is the number of 
memory references, SRF references, and LRF references for the stream architecture. 
During the entire period of pipeline, the stream architecture performs 35 memory 
references as stated in Figure 4. Amortizing this across the eight kernels gives 4.375 
memory references per kernel. The total SRF reference of the kernel3 is 59, that 
contains 9 words read from the SRF and 50 words written to the SRF. The 615 words 
of LRF bandwidth are used for each node to carry out the kernel that requires 205 
arithmetic operations. The additional LRF accesses beyond the 615 come from 



2 This number is set by 8 ALUs in an AP of MASA. 
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register accesses to originally store data from the SRF into a local register file, and 
register transfers required for other kernels. 



Table 1 . 32-bit references per point for kernels 





Stream 


Vector 


Scalar 


Memory 


4.375 


59(13.5) 


288(65.8) 


Global RF 


59 


552(9.3) 


1299(22.1) 


Local RF 


674 


N/A 


N/A 



The next two columns present the number of references for the same kernel on a 
vector processor with an organization similar to the MASA processor. Considering 
vector operations are primitive arithmetic operations instead of compound stream 
operations, using global (vector) register file instead of local register files as a source 
and sink of data for the arithmetic units, and software pipeline, the vector processor 
requires respectively 13.5 times, 9.3 times as much memory bandwidth, global 
register file bandwidth as a stream processor. 

The last two columns of Table 1 compare these numbers to a scalar processor by 
giving both the absolute number of references and the ratio to a stream processor (in 
parentheses). The scalar numbers were generated by compiling the kernel3 for MIPS 
using version 3.0.2 of the gcc compiler. The scalar processor requires respectively 
65.8 times, 22.1 times as much memory bandwidth, global register file bandwidth as a 
stream processor. Considering that the number of memory references for the stream 
architecture is an average, and the kerneB is run in the middle of pipeline, some data 
required would be in SRF, need not access memory, so the number of memory 
references for the scalar architecture is relatively magnified. However the difference 
is still great. Cache in the scalar processor is useful to shorten the gap, but there is 
little data reuse in stream application, so the benefit is limited. 

MASA exploits the space parallelism that multi-kernels can run concurrently. The 
advantages of this feature can be appreciated by comparing it with other typical 
stream processors such as Imagine. Certainly, the advantage is at the cost of hardware 
complexity. 

When SRF is full of intermediate results, the kernel which is running have to be 
changed to the other kernel to consume intermediate results. So each time the kernel 
runs, the number of records operated is limited. We call the number kernel executing 
granularity (KEG). Kernel executing granularity is limited by size of intermediate 
results and SRF. Table 2 compares the intermediate result size per node, maximum 
KEG and SRF requirement ratio of MASA with typical stream processor for inviscid 
flux computing. We analyze the intermediate result of each fluid node. These two 
processors are assumed to have the same computing ability and SRF size, though 
MASA’s distributed SRF is different from Imagine’ s central SRF in bandwidth and 
utilization. In MASA, data between kernels running at the same time will be 
transferred and consumed immediately, so this procedure can generate smaller 
intermediate result. Therefore, smaller intermediate results could effectively increase 
Kernel executing granularity and utilization ratio, decrease the cost of exchanging 
kernels; and it also makes kernel operate on a longer stream that relieve the short 
stream problem. 




